# PROJECT SETUP
# set DRAFT = False for final output; DRAFT = True is faster
DRAFT = False
import itertools
import logging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings( "ignore", module = "plotnine\..*" )
import cmdstanpy as csp
csp.utils.get_logger().setLevel(logging.ERROR)
import numpy as np
import pandas as pd
import plotnine as pn
import patchworklib as pw
def mydraw(x):
"""draw plot in the quarto doc"""
x.draw()Getting Started with Stan
in Python with CmdStanPy and plotnine
Preface
Prerequisites
We will assume the reader will be able to follow text that includes basic notions from
- differential and integral calculus in multiple dimensions,
- matrix arithmetic (but not linear algebra),
- probability theory, including probability density and mass functions, cumulative distribution functions, expectations, events, and the basic rules of probability theory, and
- Python numerical programming with NumPy.
By basics, we really do mean basics. You won’t need to do any calculus, we will just use it to express what Stan computes. Similarly, we will use matrix notation to express models, and we will avoid advanced topics in linear algebra that are at the heart of some of Stan’s internals.
We include two appendices as both background and summary, with rigorous definitions of the mathematics used in this introduction. Those who are more mathematically inclined and want to brush up on probability theory and Bayesian statistics, may wish to start with the appendices.
- Appendix A: Probability theory
- Appendix B: Bayesian statistics
Source code and license
All of the source markdown, YAML, LaTeX, and BibTeX files are available in
- Source Repository: github.com/bob-carpenter/stan-getting-started
Everything is open source, with licenses:
- Code: BSD-3-Clause license
- Text: CC-BY 4.0
Python, CmdStanPy, NumPy, pandas, and plotnine
For scripting language, we use Python 3.
To access Stan, we use the Python package CmdStanPy.
For numerical and statistical computation in Python, we use NumPy. Although we do not use it here, SciPy extends NumPy with more advanced scientific computing and statistical algorithms.
For plotting, we use the Python package plotnine. plotnine is a Python reimplementation of ggplot2, which is itself an implementation of the grammar of graphics (Wilkinson 2005).
We use pandas for representing wide-form data frames, primarily because it is the required input for plotnine.
Python boilerplate
We include the following Python boilerplate to import and configure packages we will use throughout this tutorial.
Acknowledgements
I’d like to thank Mitzi Morris, Barry Smith, and Brian Ward for helpful feedback on earlier drafts. Thanks to ChatGPT Plus 4.0 (May 5 and May 12, 2023 versions) for extensive help on plotting and data frame construction and rewriting some passages I couldn’t seem to write clearly. I’d also like to thank the entire Stan Development Team and all of our users. Stan is an open project and a large team effort and we’re always happy to help onboard new developers, and we wouldn’t be doing it without users.
1 Introduction
These notes are intended to introduce several technical topics to practitioners: Bayesian statistics and probabilistic modeling, Markov chain Monte Carlo methods for Bayesian inference, and the Stan probabilistic programming language.
1.1 Bayesian statistics
The general problem addressed by statistical inference is that of reasoning from a limited number of noisy observations. For example, we might want to perform inference about a population after measuring a subset of its members, or we might want to predict future events after observing past events.
There are many approaches to applied statistics. These notes focus on Bayesian statistics, a form of statistical modeling and inference that is grounded in probability theory. In the Bayesian approach to statistics, we characterize our knowledge of the world in terms of probabilities (e.g., there is a 24.3% chance of rain after lunch today, the probability that the next baby born in the United states is male is 51%).
Bayesian inference is always carried out with respect to a mathematical model of a stochastic data generating process. If the model is well-specified in the sense of matching the true data generating process, then Bayesian statistical inference can be shown to have several desirable properties, such as calibration and resistance to overfitting.
Appendix B provides a short, but precise introduction to Bayesian inference, following Appendix A, which reviews the basics of probability theory. This establishes both the notation and provides more formal definitions of exactly what Stan is computing.
If you’re looking for a gentle introduction to Bayesian statistics, I highly recommend Statistical Rethinking (McElreath 2023). For a more advanced introduction, try Bayesian Data Analysis (Gelman et al. 2013), which is available from the authors as a free pdf.
1.2 Markov chain Monte Carlo methods
Bayesian inference for parameter estimation, prediction, or event probability estimation is based on posterior expectations. A posterior expectation is a high dimensional integral over the space of parameters. Stan adopts the standard approach to solving general high-dimensional integrals, which is the Monte Carlo method. Monte Carlo methods use random sampling (hence the name) to solve high-dimensional integrals (which is not itself a random quantity).
We cannot use standard Monte Carlo methods for most problems of interest in Bayesian statistics because we cannot generate a sample of independent draws from the posterior density of interest.1 The exception is simple models in the exponential family with conjugate priors (Diaconis and Ylvisaker 1979). So instead, we have to resort to Markov chain Monte Carlo (MCMC) methods (Brooks et al. 2011), which create samples with correlation structure among the draws making up the sample.
Alternatives to MCMC include rejection sampling (Gilks and Wild 1992), sequential Monte Carlo (Doucet, De Freitas, and Gordon 2001), approximate Bayesian computation (ABC) (Marin et al. 2012), variational inference (Blei, Kucukelbir, and McAuliffe 2017), and nested Laplace approximation (Rue, Martino, and Chopin 2009), among others.
Among MCMC methods, Stan adopts Hamiltonian Monte Carlo (HMC) (Neal 2011), which is currently the most efficient and scalable MCMC method for smooth target densities. Popular alternatives include random-walk Metropolis-Hastings (Chib and Greenberg 1995) and Gibbs sampling (Casella and George 1992), both of which are simpler, but much less efficient than HMC for all but the easiest of problems.
1.3 Stan and probabilistic programming
Stan is what is known as a domain specific language (DSL), meaning it was written for particular application. Specifically, Stan is a probabilistic programming language (PPL) designed for coding statistical models. Although Stan is used most widely for Bayesian inference, it can also perform standard frequentist inference (e.g., maximum likelihood, bootstrap, etc.), though we do not touch on those capabilities in this introduction.
A Stan program declares data variables and parameters, along with a differentiable posterior log density (of the parameters given the data) up to a constant. Although this is the only requirement, most Stan programs define the joint log density of the parameters and variables, which can be shown to be equal to the log posterior plus a constant.
Stan programs are probabilistic programs in the sense that their data and parameters can represent random variables. One way to think of a random variable in the context of Stan is as a variable that takes its value randomly (though in fact, it takes its value pseudorandomly based on a random seed). This means that by default, running a Stan program twice will give different answers. Stan programs can be made deterministic in order to provide reproducible results by fixing a random seed.
Stan programs are translated to C++ and we estimate quantities of interest using either MCMC or approximate methods like variational inference or Laplace approximation. Users provide Stan data and access Stan output through analytics languages like Python, R, or Julia.
2 Pragmatic Bayesian statistics
There have been several schools of Bayesian statisticians, including those identifying as subjective Bayesians and those identifying as objective Bayesians. These two paradigms have diametrically opposed philosophical approaches—the “subjective” approach use as much prior information as is consistent with their beliefs, whereas the “objective” approach uses as little prior information as possible. strong priors based on beliefs, whereas the “objective” approach tries to find the weakest possible priors. Both these groups then trust their inferences based on their chosen philosophical approach.
We are going to follow a more pragmatic approach to Bayesian statistics that views model building as more of an engineering discipline than a philosophical exercise. This perspective is laid out in detail in Gelman et al. (2013) and Gelman et al. (2020). The pragmatic approach feels more like modern machine learning than statistics, with its emphasis on predictive calibration (which is itself a frequentist notion when applied to future data) and its willingness to modify a statistical model based on its behavior on the data.
The fundamental distinguishing feature of frequentist statistics is that probabilities are understood as long-run frequencies of repeatable processes. This prohibits placing probability distributions over parameters, because there is no long-term repeatable process in the world generating new parameters. For example, the gravitational constant has a single value and is not the value of a potentially repeatable trial like a coin flip (other than in a philosophical, possible worlds sense).
In our pragmatic approach to Bayesian statistics, we treat probability as fundamentally epistemic rather than deontic, meaning it is about human knowledge, not about human belief. This is a subtle, but important distinction. Although frequentists sometimes worry that Bayesians are “putting their thumb on the scale” by including their prior knowledge in a model rather than “letting the data speak for itself,” this is an instance of the pot calling the kettle black. The biggest “subjective” decision in model building is shared between Bayesian and frequentist approaches, namely the likelihood assumed to model the data-generating process. In practice, we often sidestep the concerns of subjectivity by using weakly informative priors that indicate the scale, but not the particular value, of a prior. And we furthermore run sensitivity analyses to test the effect of our prior assumptions.
Pierre-Simon Laplace (1814) begins his book on probability by stating the general epistemic position on probability in terms of an entity that knows all (aka Laplace’s demon).
We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.
John Stuart Mill (1882) is more explicit in laying out the epistemic view of probability as follows.
We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. \(\ldots\) Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.
3 Stan example: forward-simulating clinical trials
By forward simulation, we mean running a simulation of a scientific process forward from the parameter values to simulated data. For example, consider a simple clinical trial with \(N\) subjects and a probability \(\theta \in (0, 1)\) of a positive outcome. Given \(\theta\) and \(N\), we can simulate the number of patients \(y \in 0{:}N\) with a successful outcome according to a binomial distribution (which we define below).
The inverse problem is that of estimating the probability of success \(\theta,\) given an observation of \(y\) successes out of \(N\) subjects. For example, we might have \(N = 100\) subjects in the trial, \(y = 32\) of whom had a positive outcome from the trial. A simple estimate of \(\theta\) in this case would be 0.32. We return to estimation and uncertainty quantification in later sections.
Let’s say we have \(N = 100\) subjects in our clinical trial and the success rate is \(\theta = 0.3\). We can simulate a result \(y\) from the clinical trial by randomly generating the number of subjects with a successful outcome. Although this could be done by simulating the binary outcome for each patient, it wouldn’t be an efficient way to sample from a binomial distribution.
In statistical sampling notation, we write \[ Y \sim \textrm{binomial}(N, \theta) \] to indicate that there are \(N \in \mathbb{N}\) patients with probability \(\theta \in (0, 1)\) of a successful outcome, with \(Y \in 0{:}N\) representing the number of successful outcomes out of \(N\) patients.
The probability mass function function for \(Y\), written \(p_Y\), is defined for \(N \in \mathbb{N}\), \(\theta \in (0, 1)\), and \(y \in 0{:}N\) by \[\begin{align} p_Y(y \mid N, \theta) &= \textrm{binomial}(y \mid N, \theta) \\[6pt] &= \binom{N}{y} \cdot \theta^y \cdot (1 - \theta)^{N - y}. \end{align}\] Unless necessary for disambiguation, we will drop the random variable subscripts on probability density or mass functions like \(p_Y\) going forward, writing simply \(p(y \mid N, \theta)\) and allowing context to disambiguate.
3.1 A first Stan program
Let’s say we wanted to generate random instantiations of \(Y\) for given values of \(N\) and \(\theta\). We can do that using the following Stan program, which we will unpack line by line after its listing.
stan/binomial-rng.stan
data {
int<lower=0> N;
real<lower=0, upper=1> theta;
}
generated quantities {
int<lower=0, upper=N> y = binomial_rng(N, theta);
}The first thing to notice is that a Stan program is organized into blocks. Here we have two blocks, a data block containing declarations of variables that must be input as data, and a generated quantities block, which not only declares variables, but assigns a value to them. In the case of this Stan program, the generated quantity variable y is assigned the result of taking a single draw from a \(\textrm{binomial}(N, \theta)\) distribution, which Stan provides through the binomial_rng function.
The second thing to notice about a Stan program is that the variables are all declared with types. Stan uses static typing, which means that unlike Python or R, a variable’s type is declared in the program before it is used rather than determined at run time based on what is assigned to it. Once declared, a variable’s type never changes. Stan also uses strong typing, meaning that unlike C or C++, there is no way to get around the type restrictions and access memory directly.
The program declares three variables, N and y of type int (integer values in \(\mathbb{Z}\)), and theta of type real (real values in \(\mathbb{R}\)). On actual computers, our integers will have fixed upper and lower bounds and our real numbers are subject to all the vagaries of numerical floating point calculations. Stan uses double-precision (64-bit) floating point and follows the IEEE 754 standard other than in a few highly-optimized calculations that lose a few bits of precision.
A type may also have constraints. Because N is a count, it must be greater than or equal to zero, which we indicate with the bound lower=0. Similarly, the variable y is the number of successes out of N trials, so it must take on a value between 0 and N (inclusive); that is represented with the constraint lower=0, upper=N. Finally, the variable theta is real and declared to fall in the interval \([0, 1]\) with the constraint lower=0, upper=1. Technically, our bounds are open for real values, but in practice, we might wind up with 0 or 1 values due to underflow or rounding errors in floating point arithmetic.
At run time, the compiled Stan program must be given values for N and theta, at which point, each iteration it will sample a value of y using its built-in pseudorandom number generator. In code, we first define a dictionary for our data (variables N and theta), then construct an instance of CmdStanModel for our model from the path to its program, and finally sample from the model using the sample method of CmdStanModel.
N = 100
theta = 0.3
data = {'N': N, 'theta': theta}
model = csp.CmdStanModel(stan_file = '../stan/binomial-rng.stan')
sample = model.sample(data = data, seed = 123, chains = 1,
iter_sampling = 10, iter_warmup = 0,
show_progress = False, show_console = False)The Python boilerplate above set the log level to WARNING for the cmdstanpy package in order to get rid of the information-level messages that would otherwise provide updates on a running Stan program. Changing the log level to INFO provides more information as a Stan program runs, and level DEBUG provides even more.
The constructor for CmdStanModel is used to construct a model from a Stan program found in the specified file. We highly recommend using a standalone file for Stan programs to make them easy to share, to allow both quotes for printing and apostrophes for transposition, and to make it easy to find the lines referenced by number in error messages. Under the hood, this first runs Stan’s transpiler to convert the Stan program to a C++ class. Then it compiles the C++ program, which will take on the order of twenty seconds due to the heavy use of optimization and template metaprograms.
For the sample() method of the CmdStanModel object model, we provide arguments
data: the data read in the data block of the Stan program,seed: pseudorandom number generator for reproducibility,chains: the number of simulation runs (parallel_chainsindicates how many to run in parallel),iter_sampling: number of draws (i.e., sample size) to return,iter_warmup: number of warmup iterations to tune parameters of the sampling algorithm (not needed here, so set to 0),show_progress: ifTrue, print progress updates, andshow_console: pop up a GUI progress monitor.
The sample() of the compiled model class returns a sample consisting of the specified number of random draws (10 here). The way this works here is that when sample() is called, CmdStan runs Stan as a standalone C++ program in a background process. This program starts by copying he data to a file, then reading the data file in to construct a C++ object representing the model. Then for this program, with only a generated quantities block, for each draw (the number of which is specified by iter_sampling) Stan runs a pseudorandom number generator to generate a draw from the specified binomial distribution. Random number generation is done with respect to the seed value specified in the call. For more details on how pseudorandom number generation is performed; see the (free online) book by Devroye (1986) for more information. We describe the operational semantics of Stan in more detail in the section on Stan’s execution model below.
Once sampling has executed, we can extract the sample for the scalar variable y as an array and then print them along with our other variables.
y = sample.stan_variable('y')
print("N = ", N, "; theta = ", theta, "; y(0:10) =", *y.astype(int))N = 100 ; theta = 0.3 ; y(0:10) = 33 36 30 28 37 26 25 19 29 33
Let’s put that in a loop and see what it looks like for for N equal to 10, 100, 1000, and 10,000.
for N in [10, 100, 1_000, 10_000]:
data = {'N': N, 'theta': theta}
sample = model.sample(data = data, seed = 123, chains = 1,
iter_sampling = 10, iter_warmup = 0,
show_progress = False,
show_console = False)
y = sample.stan_variable('y')
print("N =", N)
print(" y: ", *y.astype(int))
print(" est. theta: ", *(y / N))N = 10
y: 4 4 3 4 3 3 3 5 2 4
est. theta: 0.4 0.4 0.3 0.4 0.3 0.3 0.3 0.5 0.2 0.4
N = 100
y: 33 36 30 28 37 26 25 19 29 33
est. theta: 0.33 0.36 0.3 0.28 0.37 0.26 0.25 0.19 0.29 0.33
N = 1000
y: 322 324 306 333 311 318 294 323 282 311
est. theta: 0.322 0.324 0.306 0.333 0.311 0.318 0.294 0.323 0.282 0.311
N = 10000
y: 3049 3052 3012 3025 3042 3087 3051 2922 2943 3025
est. theta: 0.3049 0.3052 0.3012 0.3025 0.3042 0.3087 0.3051 0.2922 0.2943 0.3025
On the first line for \(N = 10\) trials, our simple frequency-based estimates range from 0.2 to 0.5. By the time we have 10,000 trials, the frequency-based estimates only vary between 0.292 and 0.309. We know from the central limit theorem that the spread of estimates is expected to shrink at a rate of \(\mathcal{O}(1 / \sqrt{N})\) for \(N\) draws (this result is only asymptotic in \(N\), but is very close for large-ish \(N\) in practice).
It is hard to scan these results. Let’s repeat simulate 100,000 \(y\) values for each value of \(N\) and plot histograms. The following histogram plots the distribution of estimates based on 10, 100, and 1000 observations over 100,000 simulated trials.
np.random.seed(123)
ts = []
ps = []
theta = 0.3
M = 100 if DRAFT else 100_000
for N in [10, 100, 1_000]:
data = {'N': N, 'theta': theta}
sample = model.sample(data = data, seed = 123, chains = 1,
iter_sampling = M, iter_warmup = 0, show_progress = False,
show_console = False)
y = sample.stan_variable('y')
theta_hat = y / N
ps.extend(theta_hat)
ts.extend(itertools.repeat(N, M))
xlabel = 'estimated Pr[success]'
df = pd.DataFrame({xlabel: ps, 'trials': ts})
mydraw(pn.ggplot(df, pn.aes(x = xlabel))
+ pn.geom_histogram(binwidth=0.01)
+ pn.facet_grid('. ~ trials')
+ pn.scales.scale_x_continuous(limits = [0, 1], breaks = [0, 1/4, 1/2, 3/4, 1],
labels = ["0", "1/4", "1/2", "3/4", "1"], expand=[0, 0])
+ pn.scales.scale_y_continuous(expand=[0, 0, 0.05, 0])
+ pn.theme(aspect_ratio = 1, panel_spacing = 0.15,
strip_text = pn.element_text(size = 6),
strip_background = pn.element_rect(height=0.08,
fill = "lightgray")))Although the histograms have different heights and the first one is spiky, the key consideration here is that they all have the same area, representing the 100,000 simulated values of \(y\). The trial size of 10 only has 10 possible values, 0.0, 0.1, …, 1.0, so the histogram (technically a bar chart here) just shows the counts of those outcomes. Here, \(y = 3\) is the most prevalent result, with corresponding estimate for \(\theta\) of \(y / 10 = 0.3\). The trial size of 100 looks roughly normal, as it should as a binomial with trials \(N = 100\). By the time we get to \(N = 1,000\) trials, the draws for \(y\) concentrate near 300, or near the value of \(0.3\) for \(\theta\). As \(N\) grows, the central limit theorem tells us to expect that the width of these histograms to shrink at a rate of \(\mathcal{O}(1 / \sqrt{N})\).
3.2 Pseudorandom numbers
As a probabilistic programming language, Stan relies on random number generation. Because Stan runs on traditional digital computers (i.e., von Neumann machines), it cannot truly generate random numbers. Instead, it does the next best thing and uses a pseudorandom number generator (PRNG) to generate a sequence of numbers deterministically. Specifically, we set a random number generation seed, which when combined with a PRNG, generates a sequence of numbers deterministically that have many of the properties of truly random numbers. The (free online) book by Devroye (1986) is the definitive reference for pseudorandom number generation for statistical distributions and contains a general introduction to PRNGs.
We can see how random number generators work in Stan by running our sampling method with seeds 123, 19876, and 123.
for seed in [123, 19876, 123]:
sample = model.sample(data = data, seed = seed, chains = 1,
iter_sampling = 10, iter_warmup = 0,
show_progress = False, show_console = False)
print(f"{seed = }; sample = {np.asarray(sample.stan_variable('y'), dtype=int)}") seed = 123; sample = [322 324 306 333 311 318 294 323 282 311]
seed = 19876; sample = [286 303 286 309 289 276 317 298 328 294]
seed = 123; sample = [322 324 306 333 311 318 294 323 282 311]
The two runs with seed 123 produce the same results. The code to extract values of y is clunky because the stan_variable() method always returns a floating point type and we have converted it to an integer-valued NumPy array.
Generally, we want our Stan programs to replicate similar results with different random seeds. Substantially different results from different seeds is a red flag that there is something wrong with the combination of model and data.
3.3 Monte Carlo integration
Bayesian computation relies on averaging over our uncertainty in estimating parameters. In general, it involves computing expectations, which are weighted averages with weights given by densities. In this section, we will introduce Monte Carlo methods for calculating a simple integral corresponding to the expectation of a discrete indicator variable. We’ll use the textbook example of throwing darts at a board randomly and using the random locations to estimate the mathematical constant \(\pi\).
We start with a two-unit square centered at the origin. Then we will generate points uniformly at random in this square. For each point \((x, y)\), we will calculate whether it falls inside the unit circle circumscribed within the square, which is true if the distance to the origin is less than or equal to 1, \[ \sqrt{x^2 + y^2} \leq 1, \] which simplifies by squaring both sides to \[ x^2 + y^2 \leq 1. \] The proportion of such points gives the proportion of the square’s volume taken up by the circle. Because the square is \(2 \times 2\), it has an area of 4, so the circle has an area of 4 times the proportion of points falling inside the circle (i.e., in the unit disc).
Here’s the Stan code.
stan/monte-carlo-pi.stan
generated quantities {
real<lower=-1, upper=1> x = uniform_rng(-1, 1);
real<lower=-1, upper=1> y = uniform_rng(-1, 1);
int<lower=0, upper=1> inside = x^2 + y^2 < 1;
real<lower=0, upper=4> pi = 4 * inside;
}The program declares variables x and y and constrains them to fall in the interval \([-1, 1]\) and assigns them uniform random values. The indicator variable inside is set to 1 if the Euclidean length of the vector \(\begin{bmatrix}x & y\end{bmatrix}\) is less than 1 (i.e., it falls within an inscribed unit circle) and is set to 0 otherwise. The We then set a variable pi equal to 4 times the indicator value. The mean of the sample for pi will be our estimate of \(\pi\).
First, we compile and then sample from the model, taking a sample size of M = 10_000 draws. Then we plot the draws.
M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/monte-carlo-pi.stan')
sample = model.sample(chains = 1, iter_warmup = 0, iter_sampling = M,
show_progress = False, show_console = False,
seed = 123)
x_draws = sample.stan_variable('x')
y_draws = sample.stan_variable('y')
inside_draws = [int(i) for i in sample.stan_variable('inside')]
pi_draws = sample.stan_variable('pi')
inside_named_draws = np.array(["out", "in"])[inside_draws]
df = pd.DataFrame({'x': x_draws, 'y': y_draws,
'inside': inside_named_draws})
mydraw(
pn.ggplot(df, pn.aes(x = 'x', y = 'y',
group='inside', color='inside'))
+ pn.geom_point(size = 0.1)
+ pn.labs(x = 'x', y = 'y')
+ pn.coord_fixed(ratio = 1)
)Next, we take the sample mean of the indicators for being inside the circle, which produces an estimate of the probability of a point being inside the circle. This corresponds directly to the expectation \[\begin{align} \mathbb{E}[4 \cdot \textrm{I}(\sqrt{X^2 + Y^2} \leq 1)] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 \leq 1) \cdot p(x, y) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 \leq 1) \cdot \textrm{uniform}(x \mid -2, 2) \cdot \textrm{uniform}(y \mid -2, 2) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 4 \cdot \textrm{I}(x^2 + y^2 \leq 1) \, \textrm{d}x \, \textrm{d}y \\[4pt] \pi. \end{align}\]
The posterior mean inside is the probability that a random point in the 2-unit square falls in the inscribed unit circle. The posterior mean for pi is thus our estimate for \(\pi\).
Pr_is_inside = np.mean(inside_draws)
pi_hat = np.mean(pi_draws)
print(f"Pr[Y is inside circle] = {Pr_is_inside:.3f};")
print(f"estimate for pi = {pi_hat:.3f}")Pr[Y is inside circle] = 0.786;
estimate for pi = 3.144
The true value of \(\pi\) to 3 digits of accuracy is \(3.142\), so we are close, but not exact, as is the nature of Monte Carlo methods. If we increase the number of draws, our error will go down. Theoretically, with enough draws, we can get any desired precision; in practice, we don’t have that long to wait and have to make do with only a few digits of accuracy in our Monte Carlo estimates. This is usually not a problem because statistical uncertainty still dominates our numerical imprecision in most applications; we discuss this important point later when contrasting estimation uncertainty and sampling uncertainty in the section on posterior predictive inference and when considering practical guidance on how long to run Stan.
3.4 Markov chain Monte Carlo methods
In the previous sections, we generated a sample of draws by taking a sequence of independent draws. We then just averaged results to get plug-in estimates for expectations.
In modern applied Bayesian modeling, it is almost never possible to generate independent draws from the posterior distribution and apply pure Monte Carlo methods. The only case where it is possible is within the highly restrictive exponential family of sampling distributions with conjugate priors (Diaconis and Ylvisaker 1979). Before the MCMC revolution of the 1990s, Bayesian inference was largely restricted to conjugate priors. Even if the entire model wasn’t purely conjugate, conjugate priors remained popular due to the use of Gibbs sampling for MCMC, either hand-built or coded in in the first probabilistic programming language, BUGS (Lunn et al. 2012).
The introduction of automatic differentiation opened up the possibility of coding the more efficient and scalable Hamiltonian Monte Carlo method in Stan. This has led to the adoption of more non-conjugate priors and general priors.
In Markov chain Monte Carlo methods, we base each draw on the previous draw. A sequence of random variables each of which depends only on the previous variable generated is called a Markov chain. That is, a sequence of random variables \(Y_1, Y_2, \ldots\) makes up a Markov chain if \[ p_{Y_{n+1} | Y_{1}, \ldots Y_N}(y_{n + 1} | y_1, \ldots, y_n) = p_{Y_{n+1} \mid Y_n}(y_{n+1} \mid y_n) \]
We can illustrate with a simple example of three Markov chains, all of which have a stationary distribution of \(\textrm{bernoulli}(0.5)\) and thus the long-term average of all chains should approach 0.5. We will introduce a parameter \(\theta \in (0, 1)\) and take the probabilities of element \(Y_{n+1}\) depending on the previous element \(Y_{n}\) to be \[\begin{align} \Pr[Y_{n + 1} &= 1 \mid Y_n = 1] = \theta \\ \Pr[Y_{n + 1} &= 1 \mid Y_n = 0] = 1 - \theta \end{align}\] The first line says that if the last number we generated is 1, the probability of the next element being 1 is \(\theta\). The second line says that if the last number we generated is 0, the probability of the next element being 1 is \(1 - \theta\), and thus the probability of the next element being 0 is \(\theta\). That is, there’s a probability of \(\theta\) of generating the same element as the last element.
Here is a Stan program that generates the first \(M\) entries of a Markov chain over outputs 0 and 1, with probability \(\theta\) of generating the same output again.
stan/markov-autocorelation.stan
data {
int<lower=0> M;
real<lower=0, upper=1> rho; // prob of staying in same state
}
generated quantities {
array[M] int<lower=0, upper=1> y; // Markov chain
y[1] = bernoulli_rng(0.5);
for (m in 2:M) {
y[m] = bernoulli_rng(y[m - 1] ? rho : 1 - rho);
}
}The assignment to y[m] in this program is equivalent to this longer form.
if (y[m - 1] == 1) {
y[m] = bernoulli_rng(rho);
} else {
y[m] = bernoulli_rng(1 - rho);
}The more concise form exploits two features of Stan. First, boolean expressions are coded as 1 (true) or 0 (false) in Stan, like in C++ and Python. Because y[m - 1] is constrained to take on values 0 or 1, we know that y[m - 1] == 1 is equivalent to y[m - 1]. Second, it uses the ternary operator, also like in C++. The expression cond ? e1 : e2 involves three arguments, separated by a question mark (?) and a colon (:); it’s value is the value of e1 if cond is true and the value of e2 otherwise. Unlike an ordinary function, the ternary operator only evaluates e1 if the condition is true and only evaluates e2 if the condition is false.
We can simulate these models in Python and print the first 100 values simulated for the Markov chain \(y\).
model = csp.CmdStanModel(
stan_file = '../stan/markov-autocorrelation.stan')
M = 100 if DRAFT else 1000
rhos = []
iterations = []
draws = []
estimates = []
for rho in [0.05, 0.5, 0.95]:
data = {'M': M, 'rho': rho}
sample = model.sample(data = data, seed = 123, chains = 1,
iter_warmup = 0, iter_sampling = 1,
show_progress = False, show_console = False)
y_sim = sample.stan_variable('y')
cum_sum = np.cumsum(y_sim)
its = np.arange(1, M + 1)
ests = cum_sum / its
draws.extend(y_sim[0])
iterations.extend(its)
estimates.extend(ests)
rhos.extend(itertools.repeat(str(rho), M))
df = pd.DataFrame({'draw': draws, 'iteration': iterations,
'estimate': estimates, 'rho': rhos})
rho05 = np.array(df.query('rho == "0.05"').head(100)['draw'], dtype = 'int')
rho50 = np.array(df.query('rho == "0.5"').head(100)['draw'], dtype = 'int')
rho95 = np.array(df.query('rho == "0.95"').head(100)['draw'], dtype = 'int')
print("Markov chain draw with probability rho of repeating last value:\n")
print("rho = 0.05:", rho05, "\n")
print("rho = 0.50:", rho50, "\n")
print("rho = 0.95:", rho95, "\n")Markov chain draw with probability rho of repeating last value:
rho = 0.05: [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1]
rho = 0.50: [0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0
1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0
0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0]
rho = 0.95: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
With a 0.05 probability of staying in the same state, the Markov chain exhibits strong anti-correlation in its draws, which tend to bounce back and forth between 0 and 1 almost every iteration. In contrast, the 0.95 probability of staying in the same state means the draws have long sequences of 0s and 1s. The 0.5 probability produces independent draws from the Markov chain and we see short runs of 0s and 1s.
Next, we will show a running average of the 0 and 1 draws for 1000 iterations for the three chains.
mydraw(
pn.ggplot(df, pn.aes(x='iteration', y='estimate',
group='rho', color='rho'))
+ pn.geom_hline(yintercept = 0.5, color = 'black')
+ pn.geom_line()
+ pn.labs(x = "iteration", y = "estimate")
)The black horizontal line at 0.5 shows the true answer. It is clear that the anti-correlated chain (red) converges much more quickly to the true answer of 0.5 and more stably than the independent chain (green), which in turn converges much more quickly than the correlated chain (blue).
4 Hints for Python programmers
Python follows general programming language idioms like in C++ and Java, whereas Stan follows linear algebra idioms like in R and MATLAB.
4.1 Indexing from 1
Unlike Python, Stan uses the standard mathematical indexing for matrices, which is from 1. If I declare a vector[3], then the valid indexes are 1, 2, and 3. If v is a vector variable, then v[0] is an indexing error and will throw an exception and log a warning as it is caught and the resulting MCMC proposal is rejected.
4.2 Storage order
Unlike Python, Stan uses a column-major memory layout for matrices. This means it’s more efficient to iterate by column than by row through a matrix. To visit every element of a matrix, you want to loop as follows.
matrix[M, N] x; // M rows, N columns
for (n in 1:N) {
for (m in 1:M) {
... operate on x[m, n] ...
}
}Arrays in Stan are stored in row-major order, but they are not stored in a single array like in NumPy’s ndarray type. Arrays of primitives store the primitive values in memory-local order. Arrays of containers store pointers to the containers. Thus accessing array members can be non-local in memory, but will not require copying. In contrast, accessing a row or column of a matrix usually requires allocating memory and copying.
4.3 Inclusive ranges
Unlike Python, Stan uses inclusive ranges, so that 1:3 represents the sequence 1, 2, 3. The main disadvantage of inclusive notation is that the length of L:U is U - L + 1.
Putting these together, Python loops over a container with N elements look as follows.
v = np.random.rand(N)
for n in range(0, N):
... process v[n] ...The Python version visits elements v[0], v[1], …, v[N - 1].
Loops in Stan are as follows.
vector[N] v;
for (n in 1:N) {
... process v[n] ...
}The Stan version visits elements v[1], v[2], …, v[N].
4.4 Strong typing
Stan variables are strongly typed. This means every variable is assigned a type when it is declared and that type never changes. Only values of expressions of the same type of the variable may be assigned to it. Block variables are also sized and the size never changes after the declaration is executed.
There are three one dimensional (1D) real-valued container types (1D array, column vector, and row vector) and four two dimensional (2D) containers (2D array, 1D array of vectors, 1D array of column vectors, and matrix). There are functions to convert back and forth for common cases, but complicated cases must be handled manually. Loops that just do reassignment are fast in Stan as there is no overhead from derivative calculations; see the section on automatic differentiation for more details.
4.5 Block scope
Stan follows the C/C++ convention whereby variables declared within a block are local to that block. For example, in this Stan program, logit_theta is only a valid variable within the conditional block; once control has left the conditional block, logit_theta falls out of scope.
if (theta < 0.5) {
real logit_theta = logit(theta);
...
}
logit_theta = 1.9; // ILLEGAL: OUT OF SCOPEThe solution is to move up the declaration.
real logit_theta;
if (theta < 0.5) {
logit_theta = logit(theta);
...
}
logit_theta = 1.9; // OK4.6 Loops are not slow
Unlike in R and Python, loops are fast in Stan because Stan is a compiled language. For operations that only involve indexing and assignment, loops can be faster in Stan than their vectorized counterparts, because they avoid intermediate allocations.
However, this is only half of the story. Whenever functions are applied to parameters, the operation, its result, and its operands are recorded in an expression graph. For operations that involve functions other than indexing or reshaping operations applied to parameters, vectorized versions of functions will almost always reduce peak memory usage and increase speed.
4.7 Whitespace and semicolons
Stan follows C/C++ conventions on whitespace in which any sequence of whitespace characters (space, tab, new line) are interchangeable semantically. Python and R are both space-sensitive in different ways—Python uses space as a block delimiter and R has an eager line-based parser. In Stan, we use semicolons (;) to mark the end of an atomic statement. This means it is OK to continue expressions on the following line without any special syntactic marker, e.g.,
lp[k] = bernoulli_lpmf(k | theta)
+ normal_lpdf(y[n] | mu[k], sigma[k]);Both R and Python here would try to terminate the assignment to lp after the first line and leave a dangling expression for the second line. We recommend following mathematical conventions and breaking a line before an operator, ideally right before a term in a chain of additions or a factor in a chain of multiplications.
5 Stan example: Laplace’s live birth inverse problem
Given a forward model that generates data from parameters, the inverse problem is that of inferring parameter values from observed data.
5.1 Bayesian solutions to inverse problems
Bayes (1763) introduced the paradigm of statistical inference that has come to be known as Bayesian statistics.
Parametric statistics and variable types
Before introducing Bayes’s paradigm, we will try to settle on some terminology around variables. In Stan, we classify each variable as being a data variable or a parameter.
When talking about Bayesian statistics abstractly (i.e., without regard to a specific model form), we will use the variable \(y\) to represent all of the model’s data, which are values that are known or observed. Data my include
- constants: sizes, bounds,
- unmodeled data: covariates (aka features), constant parameters of priors, and
- modeled data: outcome measurements, individual or group-level covariates in generative models.
Similarly, we us the variable \(\theta\) for all of our model’s parameters, which are values that are not known and must be inferred. Parameters may include
- process parameters: parameters of the forward scientific process,
- measurement parameters: parameters of the measurement process,
- hyperpriors: hyperparameters for prior distributions,
- missing data: unknown data items, latent but potentially observable values, and
- predictive quantities: event probability indicators, predictions, forecasts, backcasts.
Constants are things like the sizes of matrices, the number of components in a mixture model, the bounds on a constrained variable, etc. Covariates are typically used as inputs (aka dependent variables) to a regression. For instance if I might want to predict someone’s presidential vote based on their income and state of residence. In this case, the income and state of residence are unmodeled covariates and the vote is modeled data if it is known.
In Stan, the number of parameters may depend on the data, but once the data \(y\) is fixed, the number of parameters is also fixed. This means we can implement some non-parametric models (e.g., Gaussian processes), but have to approximate others (e.g., Dirichlet processes).
The Bayesian process
Gelman et al. (2013) summarize the process of developing a Bayesian model for an applied problem as follows,
_Define a joint probability model over all of the observables and non-observables that is consistent with what we know about the underlying scientific process and underlying measurement process (which together form the data-generating process) and our prior knowledge.
Condition on observed data to infer the posterior distribution of any quantities of interest such as predictions or event probabilities.
Evaluate the model in terms of fit to data and resulting predictions, how sensitive the predictions are to model assumptions, and if necessary, go back to step (1) and refine the model.
Usually, the joint probability model here is defined in terms of a parametric sampling density \(p(y \mid \theta)\) which is a forward model describing how to generate observations \(y\) given parameters \(\theta\), and a prior density \(p(\theta)\) capturing what is known about the unknowns before observing the current data \(y\).
Stan is useful for expressing the model (1), performing posterior inference (2), and evaluating and comparing models (3).
Bayes’s formulation
Bayes (1763) formulated the inverse problem of determining a posterior distribution with density \(p(\theta \mid y)\) from a prior with density \(p(\theta)\) and a sampling distribution with density \(p(y \mid \theta)\). Bayes provided the following derivation, which has come to be known as Bayes’s rule. \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta) \, \textrm{d}\theta} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. & \textrm{[chain rule]} \end{align}\]
The first steps reveal the key densities involved in Bayesian inference for data \(y\) and parameters \(\theta\), \[ \underbrace{p(\theta \mid y)}_{\textrm{posterior}} = \underbrace{p(y \mid \theta)}_{\textrm{likelihood}} \cdot \underbrace{p(\theta)}_{\textrm{prior}} \ / \underbrace{p(y)}_{\textrm{evidence}}. \] When considered as a function of the parameter \(\theta\) for fixed data \(y\), as it is in Bayes’s rule, the density \(p(y \mid \theta)\) is known as the likelihood.
The second step expresses the evidence in terms of the likelihood and prior. \[ p(y) = \int_{\Theta} p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta. \] In applications, this integral is usually intractable in the sense of there being no closed-form analytic solution in terms of elementary functions. For example, even a simple logistic regression presents an intractable integral for the evidence, with \[\begin{align} p(\theta) &= \textrm{normal}(\theta, 1), \\[2pt] p(y \mid \theta, x) &= \prod_{n=1}^N \textrm{bernouolli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot \theta)) \end{align}\]
The good news is that most applications of Bayesian statistics do not require us to solve the integral presented in the denominator of Bayes’s rule, even numerically. One case where we need normalizing constants is to evaluate Bayes factors for model comparison, but like Gelman et al. (2013), we prefer other means of evaluating models like posterior predictive checks or cross-validation, both of which are beyond the scope of this getting-started tutorial.
Rather than computing the evidence, we work up to a proportion (i.e., a multiplicative constant), where we have \[ p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), \] because \(p(y)\) is a constant in the sense that it does not depend on \(\theta\). This is what lets us express the log posterior as the sum of the log likelihood and the log prior in Stan, \[ \log p(\theta \mid y) = \log p(y \mid \theta) + \log p(\theta) + \textrm{const}, \] where the constant is the negative log evidence, \(- \log p(y)\).
5.2 Live birth data
A decade after Bayes published his rule, Pierre Simon Laplace (1774) used Euler’s beta function to derive the conjugate prior for binomially distributed data. A conjugate prior for a sampling distribution is a family of distributions such that if the prior is drawn from that family, the posterior will be, too. In this section, we will work through Laplace’s analysis in what has become the standard textbook introduction to Bayesian inference.
Laplace gathered data on the sexes of babies in live births in Paris between 1745 and 1770.
| sex | live births |
|---|---|
| female | 105,287 |
| male | 110,312 |
It sure looks like there were more boys than girls born.
5.3 Laplace’s model
With \(y\) male births out of \(N\) total births, Laplace adopted the sampling distribution \[ y \sim \textrm{binomial}(N, \theta), \] which requires a number of trials \(N \in \mathbb{N}\), chance of success \(\theta \in (0, 1)\), and number of successes \(y \in 0{:}N\).
Laplace used the prior \[ \theta \sim \textrm{beta}(1, 1), \] where \[ \textrm{beta}(\theta \mid \alpha, \beta) \propto \theta^{\alpha - 1} \cdot (1 - \theta)^{\beta - 1}. \] for \(\alpha, \beta \in (0, \infty)\), and \(\theta \in (0, 1)\). The distribution \(\textrm{beta}(1, 1)\) is uniform over probabilities \(\theta \in (0, 1)\) because the density is proportional to a constant, \[\begin{align} \textrm{beta}(\theta \mid 1, 1) &\propto \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} \\[4pt] &= 1. \end{align}\]
Analytic posterior
As an aside, Laplace’s model here is simple enough to solve for the posterior analytically.
\[\begin{align}
p(\theta \mid y, N)
&\propto p(y \mid N, \theta) \cdot p(\theta)
\\[4pt]
&= \textrm{binomial}(y \mid N, \theta)
\cdot \textrm{beta}(\theta \mid 1, 1)
\\[4pt]
&\propto \theta^y \cdot (1 - \theta)^{N - y} \cdot \theta^{1 - 1} \cdot (1
- \theta)^{1 - 1}
\\[4pt]
&= \theta^{y + 1} \cdot (1 - \theta)^{N - y + 1}
\\[4pt]
&\propto \textrm{beta}(\theta \mid y + 1, N - y + 1).
\end{align}\] Despite the intermediate proportional-to steps, the normalizing constant is unique, so we can conclude that \[
p(\theta \mid y, N) = \textrm{beta}(\theta \mid y + 1, N - y + 1).
\] This entire derivation went through without ever worrying about the normalizing constant for the beta distribution or the binomial distribution.
5.4 Stan program for birth data
Unlike the first Stan model we saw, which only generates data, the following Stan program requires data to be provided, specifically the the number of male births (\(y\)) and the total number of births (\(N\)). The model will then allow us to estimate the probability of a male birth (\(\theta\)) as well as the probability that boys are more likely to be born than girls (\(\theta > 0.5\)). The Stan program for Laplace’s model is as follows.
stan/sex-ratio.stan
data {
int<lower = 0> N; // number of live births
int<lower = 0, upper = N> y; // male births
}
parameters {
real<lower=0, upper=1> theta; // chance of boy
}
model {
theta ~ beta(1, 1); // prior
y ~ binomial(N, theta); // binomial sampling
}
generated quantities {
int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
}5.5 Parameter and model blocks
In this Stan program, we see that both the number of total births \(N\) and the number of male births \(y\) are given as data. Then there are two additional blocks we did not see in our earlier program, a parameters block, which is used to declare unknown values (here, just the male birth rate \(\theta\)), and a model block, which is where we define our target log density up to an additive constant. The target log density is typically a Bayesian posterior expressed as a log prior and log sampling distribution. The parameters block declares the type of theta, which is a real value constrained to fall in the interval \([0, 1]\). The model block defines the prior, which we take to be uniform over the possible values for theta. The model block also defines the sampling distribution, which codes the fact that the observed data y was generated from a binomial distribution with N trials and theta probability of a male birth. Finally, we have a generated quantities block that defines a single binary indicator variable, boys_gt_girls. This variable will take the value 1 if the probability of a boy is greater than the probability of a girl.
5.6 Sampling from the posterior
When we run a Stan program, Stan generates a sequence of \(M\) random draws, which are approximately identically distributed according to the posterior, \[ \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid y). \] If we were to take \(M \rightarrow \infty\), the draws will converge to being identically drawn from the posterior. In simple or well behaved models, coupling arguments show that they converge to true draws from the posterior in hundreds or thousands (or sometimes more) draws (Jacob, O’Leary, and Atchadé 2020); they become numerically indistinguishable from true draws well before that time.
Stan’s sampler
Stan uses a Markov chain Monte Carlo (MCMC) algorithm, which can lead to autocorrelation in the random draws from the posterior. That is, the draws are not typically independent, with each draw being correlated (or anti-correlated) with the previous draw. This autocorrelation does not introduce bias into the Monte Carlo estimates.
In order to scale to high dimensional problems, we need to exploit gradients and/or Hessians. Duane et al. (1987) introduced the gradient-based Hamiltonian Monte Carlo (HMC) algorithm; the overview by Neal (2011) discusses theoretical and practical scaling results and Livingstone et al. (2019) provide geometric ergodicity results.
For Stan, Hoffman and Gelman (2014) developed an adaptive form of Hamiltonian Monte Carlo (HMC) known as the no-U-turn sampler (NUTS), and Betancourt (2017a) improved it with an adaptive preconditioner and multinomial sampling over trajectories. NUTS can be hyper-efficient in the sense of generating anti-correlated draws that can lead to more efficient Monte Carlo estimates than independent draws.
Fitting in Stan
We fit Laplace’s model by compiling the model, constructing a dictionary for the data, and then calling the sample method on the compiled model with the dictionary. We call the sample method with 1,000 warmup iterations and 10,000 sampling iterations; we are taking so many draws in order to draw show smooth plots later. The Stan programs considered in this introduction are all quite simple and inexpensive to run for many iterations. We consider how many draws are required for statistical inference in the section on practical guidelines.
model = csp.CmdStanModel(stan_file = '../stan/sex-ratio.stan')
boys = 110312
girls = 105287
data = {'N': boys + girls, 'y': boys}
M = 100 if DRAFT else 10_000
sample = model.sample(data = data, seed = 123,
iter_sampling = M, iter_warmup = 1000,
show_progress = False, show_console = False)As before, we proceed by first extracting the draws for the variables theta and boys_gt_girls.
theta_draws = sample.stan_variable('theta')
boys_gt_girls_draws = sample.stan_variable('boys_gt_girls')We can plot a histogram of approximate draws \(\theta^{(m)} \sim p(\theta \mid N, y)\) from the posterior to give us a sense of the value of \(\theta\) and its uncertainty given our observed data \(y\).
mydraw(
pn.ggplot(pd.DataFrame({'theta': theta_draws}),
pn.aes(x = 'theta')) +
pn.geom_histogram(color='white') +
pn.labs(x = 'θ') +
pn.theme(axis_text_y = pn.element_blank(),
axis_title_y = pn.element_blank(),
axis_ticks_major_y = pn.element_blank())
)All of the draws have a value for \(\theta\) between 0.50 and 0.52. In the next sections, we will see how to use these draws to estimate a single value for \(\theta\) as well as to compute probabilities, such as the probability that \(\theta > 0.5\) or \(\theta > 0.51\).
5.7 Bayesian point estimates
In Bayesian terms, a point estimate for a parameter \(\Theta\) conditioned on some observed data \(Y = y\) is a single value \(\hat{\theta} \in \mathbb{R}^D\) that in some way summarizes the posterior \(p(\theta \mid y)\). Formally, Bayesian estimators are defined in such a way as to minimize a loss function.
Posterior mean estimator
The most common Bayesian point estimate for a parameter is the posterior mean,
\[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid Y = y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta \\[6pt] &= \lim_{M \rightarrow \infty} \, \frac{1}{M} \sum_{m=1}^M \theta^{(m)} \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \theta^{(m)}, \end{align}\] where in the last two lines, each draw is distributed approximately according to the posterior, \[ \theta^{(m)} \sim p(\theta \mid y). \]
We have snuck in conditional expectation notation in the first line of this definition. Expectations are just weighted averages, with weights given by a probability density. Bayesian inference involves expectations over the posterior, the concise notation for which is conditional expectation notation, \[ \mathbb{E}\! \left[ f(\Theta) \mid Y = y \right] \ = \ \int_{\Theta} f(\theta) \cdot p_{\Theta \mid Y}(\theta \mid y) \, \textrm d\theta, \] where \(\Theta\) and \(Y\) are random variables with realizations \(\theta \in \Theta\) and \(y \in Y\) (we write \(y \in Y\) for a random variable \(Y\) if \(y\) is in support of \(Y\)’s density, i.e., \(p_Y(y) > 0\)).
For Laplace’s model, the estimate for the male birth rate \(\theta\) conditioned on the birth data \(y\) is calculated as the sample mean for the extracted draws for theta.
theta_hat = np.mean(theta_draws)
print(f"estimated theta = {theta_hat:.3f}")estimated theta = 0.512
Posterior median estimator, quantiles, and intervals
A popular alternative Bayesian point estimate is the posterior median, \(\theta^+\). The median is such that for each dimension \(d \in 1{:}D\), \[ \Pr[\Theta_d \leq \theta^+_d] = \frac{1}{2}. \]
The posterior median can be calculated by taking the posterior median of the draws, as follows.
theta_plus = np.median(theta_draws)
print(f"estimated (median) theta = {theta_plus:.3f}")estimated (median) theta = 0.512
Because our posterior distribution is nearly symmetric with Laplace’s data, the posterior mean and posterior median are very close.
Quantiles
Other posterior quantiles are estimated the same way. For example, if we want the posterior 95% quantile, we just take the empirical 95% point in the sorted chain of draws. For example, here are the 5% quantile and 95% quantile for Laplace’s posterior, calculated with empirical quantiles.
quantile_05 = np.quantile(theta_draws, 0.025)
quantile_95 = np.quantile(theta_draws, 0.975)
print(f"""0.05 quantile = {quantile_05:.3f};
0.95 quantile = {quantile_95:.3f}""")0.05 quantile = 0.510;
0.95 quantile = 0.514
Posterior intervals
Together, the 5% quantile and 95% quantile give us the bounds of our 90% central probability interval, which is defined as the interval containing 90% of the posterior probability mass such that half of the remaining mass (5%) is below the interval and half (5%) is above.
print(f"central 90% posterior interval for theta")
print(f" = ({quantile_05:.3f}, {quantile_95:.3f})")central 90% posterior interval for theta
= (0.510, 0.514)
Other intervals are computed in the exact same way.
Estimation error and bias
The error of an estimate is its difference from the true value, \[ \textrm{err} = \hat{\theta} - \theta. \] Our estimate \(\hat{\theta}\) is implicitly a function of the data \(y\) and so is \(\textrm{err}\), so we can make this explicit and write \[ \textrm{err}(y) = \hat{\theta}(y) - \theta. \]
The bias of an estimator is defined as its expected error (as averaged over the data distribution for the random variable \(Y\)), \[\begin{align} \textrm{bias} &= \mathbb{E}[\textrm{err}(Y)] \\[6pt] &= \mathbb{E}[\hat{\theta}(Y) - \theta] \\[6pt] &= \int_Y \hat{\theta}(y) - \theta \ \textrm{d}y. \end{align}\]
Properties of estimators
The posterior mean is a popular Bayesian estimator for two reasons. First, it is an unbiased estimator in the sense of having zero bias. Second, it has the minimum expected square error among unbiased estimators, where the squared error of an estimate is defined by \[ \textrm{err}^2(y) = \left(\hat{\theta}(y) - \theta\right)^2. \] Posterior means might not exist if the posterior has very wide tails, like the standard Cauchy distribution, \(\textrm{cauchy}(x \mid 0, 1) \propto \frac{1}{1 + x^2}\).
The posterior median \(\theta^+\) has three pleasant properties. First, it is always well defined, even for densities with poles or very wide tails. Second, it is more robust to outliers than squared error because errors are not amplified by squaring. Third, the posterior median estimator minimizes expected absolute error.
The posterior mode may not exist. For example, there is no posterior mode in standard hierarchical models. There is not even a mode for a simple density with a pole, such as \(\textrm{exponential}(x \mid 1) = \exp(-x)\) for \(x > 0\). The posterior mode is at least consistent in the sense of converging to the true value as the data size grows in cases where it exists.
We will concentrate on posterior means in this quick introduction to Bayes and Stan.
(Markov chain) Monte Carlo error and effective sample size
The Markov chain we use to sample is itself a random variable. Re-running the sampler will produce slightly different results due to Monte Carlo error (the error introduced by using only a finite sample of \(M\) draws).
Stan reports Markov chain Monte Carlo standard error along with its estimates of the mean. The MCMC standard error is for a scalar parameter \(\theta_d\) and is defined as \[ \textrm{mcmc-se} = \frac{\textrm{sd}[\Theta_d \mid Y = y]}{N^{\textrm{eff}}}, \] where the numerator is the standard deviation of the parameter \(\theta_d\) in the posterior and \(N^{\textrm{eff}}\) is the effective sample size. The posterior variance and standard deviation are defined as follows.
\[\begin{align} \textrm{var}\left[\Theta_d \mid Y = y\right] &= \mathbb{E}\left[\left(\Theta_d - \mathbb{E}\left[\Theta_d \mid Y = y\right]\right)^2 \ \big| \ Y = y\right] \\[6pt] \textrm{sd}\left[\Theta_d \mid Y = y\right] &= \sqrt{\textrm{var}[\Theta_d \mid Y = y]}. \end{align}\]
In the usual central limit theorem, the sample size (number of independent draws) appers in place of \(N^{\textrm{eff}}\). The effective sample size for a sample of size \(M\) is defined to be \[ N^{\textrm{eff}} = \frac{M}{\textrm{IAT}}, \] where \(\textrm{IAT}\) is the integrated autocorrelation time. We are not going to define it formally, but it may be thought as the interval between effectively independent draws in our Markov chain. If we have low autocorrelation, \(\textrm{IAT}\) will be close to 1 and if the autocorrelation is higher, it can be much higher. If the \(\textrm{IAT}\) is much higher than 100, it can become difficult to estimate. If the autocorrelation is negative, the \(\textrm{IAT}\) is less than 1 and the effective sample size is larger than the number of draws. Thus \(N^{\textrm{eff}}\) is the number of independent draws that would lead the same error as our correlation draws using a Markov chain.
5.8 Estimating event probabilities
Laplace wasn’t looking for a point estimate for \(\theta\). He wanted to know the probability that \(\theta > \frac{1}{2}\) after observing \(y\) male births in \(N\) trials. In the notation of probability theory, he wanted to estimate an event probability.
A subset of parameters is known as an event. We can convert conditions on parameters into events. For example, the condition \(\theta > \frac{1}{2}\) can be turned into the event \[ A = \left\{ \theta \in \Theta : \theta > \frac{1}{2} \right\}. \] Events are what are assigned probabilities by a measure in probability theory. Given a probability measure, the probability of the event \(A\), that the rate of boy births is higher than girl births, will be well defined. Because we can convert conditions to events, we will be sloppy and treat the conditions as if they were events. This allows us to write \(\Pr\!\left[\Theta > \frac{1}{2} \, \big| \, N, y\right]\) for the probability of the event \(\Theta > \frac{1}{2}\).
Event probabilities via indicators
The indicator function \(\textrm{I}\) maps propositions into the value 1 if they are true and 0 if they are false. For example, \(\textrm{I}(\theta > \frac{1}{2}) = 1\) if the proposition \(\theta > \frac{1}{2}\) is true, i.e., when \(\theta\) is greater than one half.
Event probabilities are defined as posterior conditional expectations of indicator functions for events. \[\begin{align} \Pr[\Theta > 0.5 \mid N, y] &= \mathbb{E}\!\left[\textrm{I}[\Theta > 0.5] \,\big|\, N, y\right] \\[8pt] &= \int_{\Theta} \textrm{I}(\theta > 0.5) \cdot p(\theta \mid N, y) \, \textrm{d}\theta \\[8pt] &\approx \frac{1}{M} \sum_{m=1}^M \textrm{I}(\theta^{(m)} > 0.5), \end{align}\] where we assume \(\theta^{(m)} \sim p(\theta \mid N, y)\) is distributed according to the posterior for \(m \in 1{:}M\). Following physics conventions, we use square brackets for functors (functions that apply to functions); that means we write \(\textrm{I}[\cdot]\) when we apply the indicator function to a random variable and \(\textrm{I}(\cdot)\) when we apply it to a primitive scalar.
Events as indicators in Stan
In Stan, we code the value of the indicator function directly and assign it to a variable in the generated quantities block.
generated quantities {
int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
} Conditional expressions like theta > 0.5 take on the value 1 if they are true and 0 if they are false. In mathematical notation, we would write \(\textrm{I}(\theta > 0.5)\), which takes on value 1 if \(\theta > 0.5\) and 0 otherwise; in Stan, like in C++, we treat > as a binary operator that returns either 0 or 1, so we just write theta > 0.5 in Stan.
The answer to Laplace’s question
The posterior mean of the variable boys_gt_girls is thus our estimate for \(\Pr[\theta > 0.5 \mid N, y]\). It is essentially 1. Printing to 15 decimal places, we see
Pr_boy_gt_girl = np.mean(boys_gt_girls_draws)
print(f"estimated Pr[boy more likely] = {Pr_boy_gt_girl:.15f}")estimated Pr[boy more likely] = 1.000000000000000
The value of 1 returned as an estimate brings up the important problem of numerical precision. As we can see from the histogram, all of our sampled values for \(\theta\) are greater than \(\frac{1}{2}\).
Laplace calculated the result analytically, which is \[ \Pr\!\left[\Theta > \frac{1}{2} \ \bigg| \ N, y\right] \approx 1 - 10^{-27}. \] Thus we would need an astronomical number of posterior draws before we would generate a value of \(\theta\) less than \(\frac{1}{2}\). As given, the answer of 1.0 is very close to the true answer and well within our expected Monte Carlo error. As an aside, using the standard double-precision (8 byte, 64 bit) floating point representation for real numbers, the number \(1 - 10^{-27}\) will round to \(1\) because machine precision, which is defined as the largest \(\epsilon\) such that \(1 - \epsilon \neq 1\), is about \(10^{-16}\). Let’s try it in Python, which, like Stan, uses double-precision arithmetic by default.
print(f"{(1.0 == (1.0 - 1e-27)) = }")(1.0 == (1.0 - 1e-27)) = True
which one would expect given \(10^{-27}\) is below the machine precision,
print(f"Machine precision: {np.finfo(float).eps}")Machine precision: 2.220446049250313e-16
MCMC summary statistics from Stan
With Stan, we can print a summary for the variable \(\theta\) in the posterior, which reports all of these values. We just call The .summary() function on the sample.
sample.summary(sig_figs = 3)| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -149000.000 | 0.004790 | 6.660000e-01 | -149000.00 | -149000.000 | -149000.000 | 19300.0 | 38700.0 | 1.0 |
| theta | 0.512 | 0.000009 | 1.090000e-03 | 0.51 | 0.512 | 0.513 | 15300.0 | 30700.0 | 1.0 |
| boys_gt_girls | 1.000 | NaN | 9.380000e-14 | 1.00 | 1.000 | 1.000 | NaN | NaN | NaN |
The rows are for our random variables, with lp__ indicating the unnormalized log density defined by the Stan program. The unnormalized log density includes any change-of-variables adjustments required for transforming constrained parameters; the transforms for constrained parameters and associated change-of-variables adjustments are detailed in (Stan Development Team 2023b).
The other two rows are for variables defined in the Stan program, theta, and boys_gt_girls. The number of significant figures used in the results can be controlled in the summary function, as can the quantiles being reported.
The first column reports the posterior mean, and agrees with our earlier calculations for both variables. The second column is the Monte Carlo standard error (based on an estimated effective sample size) and a posterior standard deviation estimate. The next three columns are quantiles we computed earlier, and they also agree with our calculations. Next is the effective sample size, which can vary from variable to variable, and the effective sample size rate (per second). The final column reports \(\widehat{R}\), which we discuss in the next section.
6 Warmup and convergence monitoring
When running Markov chains, we want to make sure that we have moved far enough that our draws are approximately from the posterior. A standard way to monitor convergence is to start multiple Markov chains at different initializations (ideally chosen from a diffuse initialization like a draw from the prior) and measure whether they are producing draws from the same distribution.
6.1 Warmup
During its initial warmup iterations, Stan tries to find the high probability mass region in which it should be sampling, adapt a good step size, and estimate posterior variance. The variance is used to precondition the sampler through scaling; see (Neal 2011) for details. Stan can also estimate a dense covariance matrix and precondition with rotation and scaling (i.e., with a Euclidean metric).
Warmup converges when the step size and posterior covariance estimates no longer change. With multiple chains, it’s possible to test that they have all converged to roughly the same step size and covariance estimate. Unless things are going wrong, we typically don’t bother measuring whether adaptation has converged and instead measure our end goal directly, which is whether we are getting reasonable posterior draws after warmup.
Warmup doesn’t form a single coherent Markov chain because it uses memory to adapt. Once Stan starts sampling, the result is a Markov chain. All of our posterior analysis will be with draws from the Markov chain, not from warmup. We can save and extract the warmup draws to investigate the behavior of warmup.
6.2 Potential scale reduction and \(\widehat{R}\)
Stan uses the potential scale reduction statistic \(\widehat{R}\) (pronounced “R hat”). Given a sequence of Markov chains, Stan splits each of them in half to make sure the first half and second half of the chain agree, then calculates variances within each chain and across all chains and compares. The statistic \(\widehat{R}\) converges to 1 as the Markov chains converge to the same distributions.
6.3 How many chains for how long?
A simple rule of thumb is to run four chains until \(\hat{R} \leq 1.01\) and effective sample size (ESS) is greater than 100. The reason we recommend an effective sample size of “only” 100 is that it means the standard error will be \(\frac{1}{10}\) the size of the standard deviation. Given that posterior standard deviation represents residual uncertainty, calculating means to higher precision is rarely worthwhile.
The easiest way to achieve \(\widehat{R} \leq 1.01\) and \(N^{\textrm{eff}} > 100\) is to start with 100 warmup iterations and 100 sampling iterations. If there are \(\widehat{R}\) values that are too large or there are effective sample size values that are too low, then double the number of sampling and warmup iterations, and try again. Running more warmup iterations is important because sampling will not be efficient if warmup has not converged. At most, using the same number of warmup and sampling iterations costs a factor of two over the optimal settings, which are not known in advance.
Even if we use more than four chains, we need to make sure that our effective sample size is at least 25 per chain. It is not that we need so many draws for inference, but that we do not trust our effective sample size estimator if it is much lower. One way to check that the ESS estimator is OK is to double the number of draws and make sure that the ESS also doubles. If it doesn’t, it’s a sign that the first ESS estimate is unreliable.
6.4 Running chains concurrently
You can set the number of chains that are run using the chains argument of the sample() method and you can set the maximum number of chains to execute concurrently using parallel_cores (which defaults to 1, sequential execution). If you set the maximum number of parallel chains to be too low, CPU resources are potentially unused. If you set the number too high, then either CPU or memory can bottleneck performance and cause it to be slower overall than running with fewer chains. The only advice I can give here is to experiment. In personal projects on our own hardware, the goal is usually the largest effective sample size in the minimum amount of time. On the other hand, I sometimes find I need to leave enough processing power left over to continue to work on documents, email, etc. In a server setting, memory usage, latency, throughput, and I/O need to be balanced even more carefully between throughput and resource usage.
7 Stan example: A/B testing
A common application of statistics is to compare two things, such as the effectiveness of a new drug versus the current drug used to treat a condition. Another application might compare the effectiveness of two different advertisement presentations in getting users to click through. This is usually called A/B testing in a nod to comparing a hypothetical option A and option B.
Let’s consider three good Mexican restaurants in New York City, Downtown Bakery II in the East Village, Taqueria Gramercy in Gramercy, and La Delicias Mexicanas in Spanish Harlem. Here’s the number of reviews and 5-star reviews for these restaurants on the web site Yelp.
| name | 5-star reviews | total reviews |
|---|---|---|
| Downtown Bakery II | 141 | 276 |
| Taqueria Gramercy | 84 | 143 |
| La Delicias Mexicanas | 41 | 87 |
We can estimate a few things. First, we can estimate the probability that each restaurant really is a 5-star restaurant. We will parameterize this directly with a rate of 5-star reviews parameter for each restaurant. Then we can rank the restaurants based on their probability of being a 5-star restaurant. What does it mean to “be a 5-star restaurant” in this sense? It means getting 5-star reviews from Yelp reviewers. Our model is going to treat the reviewers as exchangeable in the sense that we don’t know anything to distinguish them from one another. Now whether this notion of 5-star restaurant is useful will depend on how similar the reader is to the population of reviewers.
We will assume that the number of 5-star reviews for a restaurant \(k \in 1{:}K\) depends on its underlying quality \(\theta_k \in [0, 1]\), \[ n_k \sim \textrm{binomial}(N_k, \theta). \] Here \(N_k \in \mathbb{N}\) is the number of reviews for restaurant \(k\) and \(n_k \in 0{:}N_k\) the number of 5-star reviews. We further assume the probabilities of 5-star reviews have a beta distribution, \[ \theta_k \sim \textrm{beta}(\alpha, \beta). \] In a beta distribution, the sum \(\alpha + \beta\) determines how much to regularize estimates toward \(\alpha / (\alpha + \beta)\), with \(\alpha = \beta = 1\) providing a uniform distribution.
For inference, we will be interested in the posterior distribution \(p(\theta \mid N, n)\) of 5-star review probabilities. This gives us the posterior density of the restaurants’ probability of receiving a 5-star review. With this posterior, we can rank restaurants based on their probability of receiving a 5-star review and calculate the probability that each is the best restaurant, \[ \Pr[\Theta_k = \max(\Theta) \mid n, N]. \]
7.1 Stan model for A/B testing
We will assume there are a total of \(K\) items being compared. In traditional A/B testing, \(K = 2\), but our example uses \(K = 3\) and our code works for any \(K\). We define the indicator array for our event probability estimates in the generated quantities block.
stan/ab-test.stan
data {
int<lower=0> K;
array[K] int<lower=0> trials;
array[K] int<lower=0> successes;
real<lower=0> alpha;
real<lower=0> beta;
}
parameters {
array[K] real<lower=0, upper=1> theta;
}
model {
successes ~ binomial(trials, theta);
theta ~ beta(alpha, beta);
}
generated quantities {
array[K] int<lower=0, upper=1> is_best;
for (k in 1:K) {
is_best[k] = theta[k] == max(theta);
}
}We have coded the data for \(K\) items directly in terms of number of trials and number of successes. We have also supplied the prior for the probabilities of success as data as alpha and beta. The model is the same binomial as we had before, except now the likelihood and priors are vectorized. In general, Stan is able to take something like the binomial distribution, which has an integer number of trials, integer number of successes, and a scalar success probability and take vectors for all of these. What we have written above is identical to what we would get with a loop,
for (k in 1:K) {
successes[k] ~ binomial(trials[k], theta[k])
}The vectorization of theta is different in that only theta is an array, whereas alpha and beta are scalars. The sampling statement for theta is equivalent to
for (k in 1:K) {
theta[k] ~ beta(alpha, beta);
}Because alpha and beta are scalars, they are not indexed. Rather, they are broadcast, meaning that the same alpha and beta is reused for each dimension of theta.
So now let’s call and fit this model and print the summary. We are setting \(\alpha = \beta = 2\), which is equivalent to setting them equal to 1 and adding 2 trials and 1 success to the data for each restaurant.
M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/ab-test.stan')
data = {'K': 3, 'trials': [276, 143, 87],
'successes': [141, 84, 41],
'alpha': 2, 'beta': 2}
sample = model.sample(data = data, seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False, show_console = False)
sample.summary(sig_figs = 2)| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -360.000 | 0.02600 | 1.200 | -360.00 | -360.00 | -360.00 | 2100.0 | 28000.0 | 1.0 |
| theta[1] | 0.510 | 0.00048 | 0.030 | 0.46 | 0.51 | 0.56 | 3813.0 | 50172.0 | 1.0 |
| theta[2] | 0.590 | 0.00062 | 0.040 | 0.52 | 0.59 | 0.65 | 4184.0 | 55059.0 | 1.0 |
| theta[3] | 0.470 | 0.00071 | 0.051 | 0.39 | 0.47 | 0.56 | 5185.0 | 68222.0 | 1.0 |
| is_best[1] | 0.065 | 0.00440 | 0.250 | 0.00 | 0.00 | 1.00 | 3074.0 | 40448.0 | 1.0 |
| is_best[2] | 0.900 | 0.00540 | 0.300 | 0.00 | 1.00 | 1.00 | 3035.0 | 39937.0 | 1.0 |
| is_best[3] | 0.034 | 0.00310 | 0.180 | 0.00 | 0.00 | 0.00 | 3393.0 | 44644.0 | 1.0 |
First, we make sure that our \(\widehat{R}\) values are less than 1.01 and that our \(N^{\textrm eff}\) are all greater than 25 per chain (4 chains here with default settings). In fact, we can see that our sampling is nearly as efficient as if we had independent posterior draws. The probability of a 5-star review is our value for theta, which are 51% for Downtown Bakery, 59% for Taqueria Gramercy, and 47% for La Delicias. Even so, we see that the probability that Taqueria Gramercy is the most likely to have their next review be a 5-star review, it’s still only 90% that’s the case. This is because binomial data is weak and we only have observations in the hundreds. (Editor: Don’t listen to Yelp—La Delicias is the best of these restaurants by far.)
8 Stan’s execution model
Stan programs consist of several blocks. Here is a summary of the blocks, when they are executed, and what they do. None of the blocks are required, but the blocks that do appear must be in the order listed here.
| block | executed | behavior |
|---|---|---|
functions |
as needed | user-defined function definitions |
data |
once | read data to construct model |
transformed data |
once | define transformed data |
parameters |
once / log density | constrain parameters (w. Jacobian) |
transformed parameters |
once / log density | define transformed parameters |
model |
once / log density | evaluate model log density |
generated quantities |
once / draw | define generated quantities |
8.1 Data and transformed data
The data block contains only variable declarations. The variables declared in the data block are read once at data load time.
The transformed data block contains variable declarations and definitions. It is executed once, after the data is read is in, to define the transformed data variables. It defines functions of the data, such as standardized predictors, constants to use for priors, etc. The transformed data block may use pseudorandom number generation.
All variable declarations at the block level must declare types and sizes (which may be data dependent). Local variables within blocks are declared without sizes.
Constraints in the data block are evaluated at the end of the block’s execution. Any violations throw exceptions which terminate execution.
Transformed data variables may be assigned in the transformed data block, but after these blocks have executed, the variables may not be reassigned.
8.2 Parameters and transformed parameters
The parameters block declare the variables over which the model is defined. It contains only variable declarations with sizes. When executed, it is supplied with concrete parameter values, which it transforms to the unconstrained scale based on the declared constraints (if any).
Constraints on parameters are used to define a transform of the constrained space to \(\mathbb{R}^D\). For example, a variable declared with a lower=0 constraint is log-transformed to an unconstrained variable, whereas a variable declared as a covariance matrix is Cholesky factored and then a log transform is applied to the diagonal elements to render an \(N \times N\) matrix as a vector in \(\mathbb{R}^{\binom{N}{2}}\). It is critical that parameters are declared with all necessary constraints so that the model has support (i.e., non-zero density, finite log density) over the entire transformed space.
The transformed parameters block defines functions of parameters and data. Any constraints declared on transformed parameters are validated at the end of the block’s execution. If the constraints are violated, an exception will be thrown, which typically causes the current proposal to be rejected.
Variables declared in the parameters block are like function argument declarations. The log density function defined by a Stan program takes the parameters as an argument. Thus the parameter values are always supplied externally to a Stan program.
After the transformed parameters block has executed, variables declared in it may not be reassigned.
The only difference between variables declared as local variables in the model block and those declared in the transformed parameters block is that the transformed parameters are printed and also available in the generated quantities block.
8.3 Model
The model block defines the log density and is evaluated once per log density evaluation. The log density starts with any components added by the log Jacobian determinants of the inverse transforms of the parameters from the unconstrained to the constrained space. The model is defined using constants or modeled data from the data or transformed data block, and parameters or transformed parameters.
The target log density that will be returned may be incremented directly, e.g.,
target += -0.5 * x^2;The current value of the target is available as target(); printing it can be useful for debugging.
Sampling statements are just syntactic sugar for incrementing the target log density. For example, the sampling statement
x ~ normal(0, 1);is equivalent to the target increment statement
target += normal_lupdf(x | 0, 1);The vertical bar is used to separate outcome variates from the parameters. The notation lpdf denotes a log probability density function, with lpmf for log probability mass functions. The variants lupdf and lupmf are for their unnormalized counterparts, which may drop normalizing constants that do not depend on the parameters. Unless the normalizing constants are needed, for example in a mixture model component, it is more efficient to use the lupdf and lupmf forms either explicitly or implicitly through sampling statements.
8.4 Generated quantities
The generated quantities block is evaluated once per draw rather than once per log density evaluation. With algorithms like Hamiltonian Monte Carlo, each draw may require a few, dozens, or even hundreds of log density evaluations. The further advantage of generated quantities is that they are evaluated with double-precision floating point values because they do not require differentiation, which is a factor of four or more faster than autodiff. The generated quantities block may also use pseudo-random number generation. Any constraints are evaluated at the end, but exceptions do not cause rejections, just potential warnings and potentially undefined (not-a-number) values.
Generated quantities do not contribute to the definition of the log density being sampled, but they are nevertheless part of the statistical model. The typical role of generated quantities is for posterior predictive inference, which is conditionally independent of the observed data given the model parameters.
8.5 User-defined functions
Users can define functions in the functions block. These can be applied anywhere in a Stan program, just like built-in functions. In addition to ordinary mathematical functions, users can define several types of special-purpose functions with Stan-specific behavior.
User-defined probability density and mass functions
Defining a function with the suffix _lpdf defines a continuous log probability density function where the first variable is a real- or complex-valued variate (these may be containers like matrices).
Similarly, the suffix _lpmf marks a discrete log probability density function and Stan will validate that the first argument is declared as an integer.
These functions are called using standard vertical bar notation to separate the variate and its parameters and they may also be used in sampling statements. For example, consider the following custom log probability density function.
functions {
real foo_lpdf(real y, real theta) {
return normal_lpdf(y | 0, exp(-theta))
}
}With this definition, the model block may use the function to increment the target log density,
target += foo_lpdf(z | phi);User-defined functions ending in _lpdf may also be used directly in sampling statements after removing the suffix.
z ~ foo(phi);A user-defined _lpdf function implicitly defines an equivalent _lupdf function so that user-defined _lpdf functions may be used with sampling notation. User-defined functions ending in _lupdf or _lupmf are not allowed.
User-defined random number generators
Only functions defined with _rng suffixes will be able to call other functions with _rng suffixes and they will only be able to be used in Stan programs in the transformed data and generated quantities blocks (i.e., the blocks that are not part of the model definition and hence do not need to be differentiated).
For example, a simple way to generate chi-squared random variates is to literally sum a sequence of squared normal variates (in a real program, one would use the the built-in chi-square _rng function).
real simple_chi_sq_rng(int n) {
real y = 0;
for (i in 1:n) {
y += normal_rng(0, 1)^2;
}
return y;
}Modifying the target
Functions that use the suffix _lp can access and modify the log density. This means they can only be used in transformed parameters or model blocks. For example, the following function implements what Stan does implicitly for a variable declared with a lower=0 constraint.
real pos_constrain_lp(real v) {
target += v; // change of variables adjust: log abs(d/dv exp(v))
return exp(v); // change of variables
}8.6 Automatic differentiation
Stan uses automatic differentiation to define gradients of the log density function. This is done by building up the complete expression graph of the log density being defined by the Stan model. A concrete value for unconstrained parameters is supplied, these are constrained and the log absolute determinant of the Jacobian of the unconstraining transformed is added to the expression graph. Then each operation involving parameters is added to the expression graph. The root is the final log density value, which is the value of target. The leaves are the unconstrained parameters. A reverse pass over this expression graph propagates the partial derivatives from the log density value down to the unconstrained parameters using the chain rule (it applies in the reverse order of the expression construction, which is a topological sort of the directed graph). See Carpenter et al. (2015) for details of Stan’s C++ architecture for automatic differentiation; current details are only in the developer documentation and code repository.
9 Stan example: Regression and prediction
In this section, we will go over simple regression models in Stan and see how to generate predictions for new items based on fitted parameters.
9.1 Fisher’s iris data set
We will analyze Fisher’s classic iris data set, which provides sepal and petal length and width (in centimeters) as well as the species of iris. First we read it in, then we will plot petal width versus petal length, with species indicated by color.
df = pd.read_csv('../stan/iris-data.csv')
mydraw(
pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
color='species')) +
pn.geom_point() +
pn.labs(x = "petal width (cm)", y = "petal length (cm)")
)The three species are of very different sizes, but there is a roughly linear relation between petal width and petal length across the different species.
The width values all have exactly one decimal place of accuracy, which means they were recorded to the nearest millimeter. There are models that deal with this kind of measurement quantization, but we will not consider them now and will instead proceed as if there were no rounding of our measurements.
9.2 Linear regression
A simple univariate linear regression models one measurement as a linear function of another measurement. In this example, we will model an iris flower’s petal length as a linear function of its petal width. If we let \(x_n\) be the petal width for flower \(n\) and let \(y_n\) be the petal length. A linear regression says that we expect \(y_n\) to be a linear function of \(x_n\), or roughly, the expected value of \(y_n\) will be \(\alpha + \beta \cdot x_n\) for some intercept \(\alpha\) and slope \(\beta\).
The linear relation in a linear regression only holds in expectation. That is, we expect the value of \(y_n\) to be \(\alpha + \beta \cdot x_n\), but in real observations there will be error due to either measurement or a failure of the linear relationship. In the Iris data plot, you can see that there are multiple irises with exactly the same measured petal width but varying petal lengths.
We can introduce a variable \(\epsilon_n\) for the difference between a petal’s length and its expected length given the linear relationship, \[ y_n = \alpha + \beta \cdot x_n + \epsilon_n. \] It is traditional to assume the error is normally distributed with a zero mean and scale \(\sigma > 0\), \[ \epsilon_n \sim \textrm{normal}(0, \sigma). \] We can rearrange terms and write this in a fashion that will mirror how it’s coded in Stan, \[ y_n \sim \textrm{normal}(\alpha + \beta \cdot x_n, \sigma). \] In this form, it’s clear that the linear prediction is the expectation for the random variable \(Y_n\) and the standard deviation is the scale, \[ \mathbb{E}[Y_n] = \alpha + \beta \cdot X_n \qquad \mathrm{sd}[Y_n] = \sigma. \]
Here is a simple Stan model to regress petal length on petal width; we replace \(x\) with petal_width and \(y\) with petal_length.
../stan/iris-petals.stan
data {
int<lower=0> N;
vector<lower=0>[N] petal_width;
vector<lower=0>[N] petal_length;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
petal_length ~ normal(alpha + beta * petal_width, sigma);
alpha ~ normal(0, 5);
beta ~ normal(0, 5);
sigma ~ lognormal(0, 1);
}The data block says there are \(N\) observations of petal width and length. The widths and lengths are declared to have type (column) vector, with a constraint lower=0 and size (number of rows) N. We are declaring these types as vectors because we are going to apply vector arithmetic to them.
The model parameters are coded in the parameter block. The lower bound for sigma is required here as the normal distribution is only defined for positive values of sigma. Stan requires every parameter value that satisfies the declared constraints to be in the support of the density, which is the set of values for which the density is non-zero (equivalently, the log density is finite).
The model codes the regression following the math, but without the subscripts. What’s going on is that the sampling statement for petal_length is vectorized. The variables petal_length and petal_length are both vectors of size N, whereas alpha, beta, and sigma are all scalars. Working outward, beta * petal_width is a scalar times a vector, which is defined in the usual way as a vector whose nth entry is beta * petal_width[n]. We then add alpha to that result (multiplication binds more tightly than addition and as in mathematical writing, we drop all unnecessary parentheses). This is defined by broadcasting so that alpha acts like a vector of size N and addition is defined in the usual way for vectors. As a result, alpha + beta * petal_width is a vector of size N, whose nth element is alpha + beta * petal_width[n]. Now we have a vector for the location parameter of a normal distribution, a scalar scale (sigma), and a vector outcome (petal_length). This works by broadcasting sigma to use for each n. The end result is equivalent to the following, but much more compact and efficient.
for (n in 1:N) {
petal_length[n] ~ normal(alpha + beta * petal_width[n], sigma);
}The rest of the model is priors for the regression coefficients and error scale. The coefficients are unconstrained, but the error scale is constrained to be positive, so we use a lognormal distribution (we could have also used a half normal and we should if scales near zero are possible).
We can then compile and fit the model and display the resulting fit for alpha and beta as a scatterplot.
def iris_data_frame():
return pd.read_csv('../stan/iris-data.csv')
def iris_data():
df = iris_data_frame()
N = df.shape[0]
petal_width = df['petal_width']
petal_length = df['petal_length']
species = 1 + pd.Series(df['species']).astype('category').cat.codes
num_species = 3
data = {'N': N,
'K': num_species,
'species': species,
'petal_width': petal_width,
'petal_length': petal_length,}
return data
M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/iris-petals.stan')
sample = model.sample(data = iris_data(), seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False, show_console = False)
sample.summary(sig_figs = 2)| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | 34.00 | 0.03400 | 1.200 | 31.00 | 34.00 | 35.00 | 1200.0 | 6500.0 | 1.0 |
| alpha | 1.10 | 0.00180 | 0.074 | 0.97 | 1.10 | 1.20 | 1700.0 | 8800.0 | 1.0 |
| beta | 2.20 | 0.00130 | 0.053 | 2.10 | 2.20 | 2.30 | 1700.0 | 9000.0 | 1.0 |
| sigma | 0.48 | 0.00071 | 0.028 | 0.44 | 0.48 | 0.53 | 1500.0 | 8000.0 | 1.0 |
This summary doesn’t give us a good feeling for the uncertainty in the linear relationship. To do that, we will plot multiple draws from the posterior along with the data.
alpha_draws = sample.stan_variable('alpha')
beta_draws = sample.stan_variable('beta')
plot = pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
color='species'))
plot = plot + pn.geom_point()
plot = plot + pn.labs(x = "petal width (cm)", y = "petal length (cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
plot = plot + pn.geom_abline(intercept = a, slope = b,
alpha = 0.5, size = 0.2)
mydraw(
plot
)In order to fit all three groups of data, the plot doesn’t do such a great job of fitting any of them—the larger instances among iris setosa examples and the larger instances among the irs versicolor have underestimated petal lengths.
We didn’t include the error scale, but it’s roughly 0.5. That means it won’t be uncommon to get errors greater than 1 or less than -1. For iris setosa, this could easily result in predictions with negative lengths!
9.3 Built-in generalized linear model functions
Stan supplies built-ins for the most common generalized linear model (GLM) functions. These allow linear and non-linear regressions to be coded easily and efficiently. For example, our linear regression could be coded with the normal_id_glm function and a logistic regression coded with the bernoulli_logit_glm. The linear regression for the Iris data could be written as follows
petal_length ~ normal_id_glm(petal_width, alpha, beta, sigma);if the data and parameter type declarations are generalized to
matrix<lower=0>[N, 1] petal_width;
vector[N] petal_length;
real alpha;
vector[1] beta;
real<lower=0> sigma;The generalization of a vector of predictors to a matrix allows multivariate regressions with multiple predictors per item (e.g., using population density, speed limit, traffic density, and time of day to predict pedestrian traffic accidents).
9.4 Robust regression
Although we won’t consider this model in detail, Stan’s plug-and-play design makes it very simple to convert our linear regression into a robust regression by swapping out the normal error model for a Student-t error model,
petal_length ~ student_t(dof, alpha + beta * petal_width, sigma);We have just used dof for the degrees of freedom variable, but setting it at a value like 4 provides wider tails and setting it at 1 produces the very wide-tailed Cauchy distribution, which does not even have a finite mean or variance.
9.5 Posterior predictive inference
Now let’s say we have a new observation where we know the petal width and want to predict its length. Mathematically, to make a prediction for a new item, we use posterior predictive inference.
First, let’s consider evaluating the log density of a new petal’s length (\(\tilde{y}\)) given its observed width (\(\tilde{x}\)), having observed our original data \(x\) and \(y\). In Bayesian inference, we evaluate the posterior predictive distribution, where we let \(\Theta\) be our random parameters with realization \(\theta\), \[\begin{align} p(\tilde{y} \mid \tilde{x}, x, y) &= \mathbb{E}[p(\tilde{y} \mid \tilde{x}, \theta) \mid x, y] \\[6pt] &= \int_{\Theta} p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \, p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \end{align}\] where \(\theta^{(1)}, \ldots, \theta^{(m)} \sim p(\theta \mid x, y)\) are draws from the posterior. The parameters are marginalized out. It’s perhaps easiest to understand by considering the Monte Carlo approximation, which just averages the sampling log density over draws of parameters from the posterior.
Sampling uncertainty and estimation uncertainty
It can help to break the integral down into the two components of uncertainty, sampling uncertainty due to our sampling distribution and posterior uncertainty in the values of our parameters, \[ \int_{\Theta} \underbrace{p(\tilde{y} \mid \tilde{x}, \theta)}_{\textrm{sampling uncertainty}} \cdot \underbrace{p(\theta \mid x, y)}_{\textrm{estimation uncertainty}} \, \textrm{d}\theta \] In our case, \(\theta = \begin{bmatrix}\alpha & \beta & \sigma\end{bmatrix}\) and the sampling distribution is \[ p(\tilde{y} \mid \tilde{x}, \alpha, \beta, \sigma) = \textrm{normal}(\tilde{y} \mid \alpha + \beta \cdot \tilde{x}, \sigma). \] Even if we know the parameter values \(\alpha\), \(\beta\), and \(\sigma\) exactly, we can only predict \(\tilde{y}\) to within \(\epsilon\), where \(\epsilon \sim \textrm{normal}(0, \sigma)\). If we plug in a point estimate \(\widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}\) for our parameters, we might get approximate inference that takes into account sampling uncertainty, but not estimation uncertainty, e.g., \[ p(\tilde{y} \mid \tilde{x}, x, y) \approx p(\tilde{y} \mid \tilde{x}, \widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}). \]
So far, this only gives us a way to evaluate the log density of a resulting outcome \(\tilde{y}\) given a predictor \(\tilde{x}\). If we want to simulate possible \(\tilde{y}\), we have to make sure to add sampling uncertainty and draw \[ \tilde{y}^{(m)} \sim \textrm{normal}\!\left(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x} \mid \sigma^{(m)}\right), \]
where \(\alpha^{(m)}, \beta^{(m)}, \sigma^{(m)} \sim p(\alpha, \beta, \sigma \mid x, y)\) are posterior draws. In general, if we then want to estimate \(\tilde{y}\), we can take posterior means of these values. In the case here, our sampling distribution is symmetric, so that the expectation of the random draws and the expectation of their mean is identical. If our only goal is to estimate an expectation, it is better to average expectations than to average random draws with those expectations. This is formalized in the Rao-Blackwell theorem. Converting an estimator that uses a sample to one that uses expectations of that sample is called “Rao-Blackwellization.”
Standalone generated quantities
Let’s put all this together into a Stan program. First, let’s make posterior predictions for some new \(\tilde{y}\). We can do this using Stan’s standalone generated quantities feature that lets us run the generated quantities block of a new model given draws from another model. We could’ve also just included the generated quantities block in the original program.
iris-predict.stan
data {
int<lower=0> N_tilde;
vector<lower=0>[N_tilde] petal_width_tilde;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
generated quantities {
vector<lower=0>[N_tilde] E_petal_length_tilde
= alpha + beta * petal_width_tilde;
vector<lower=0>[N_tilde] petal_length_tilde
= to_vector(normal_rng(E_petal_length_tilde, sigma));
}This program declares two new data variables, N_tilde for the number of predicted items, and petal_width_tilde, a vector of petal widths of size N_tilde.
It is important that the program used for standalone generated quantities use exactly the same parameters as the original program. After our original fit, we read that sample back in for the parameters, then run generated quantities (and the transformed parameter block if there is one).
The generated quantities block first calculates the expected petal length given the petal width and assigns it to the variable E_petal_length_tilde. This code is vectorized so that it calculates all of the input petal widths at once. The second variable is then set by sampling according to the sampling distribution using the normal_rng function. This function is also vectorized, but it returns an array, so we convert it to a vector just to keep all the types the same.
data = {'N_tilde': 3,
'petal_width_tilde': [0.4, 1.75, 3.8]}
model = csp.CmdStanModel(
stan_file = '../stan/iris-posterior-predictive-sim.stan')
pps_sample = model.generate_quantities(data = data, seed = 123,
previous_fit = sample,
show_console = False)
print('Estimated petal lengths given petal widths')
for i in range(3):
length_draws = \
pps_sample.stan_variable('petal_length_tilde')[0:100, i]
E_length_draws = \
pps_sample.stan_variable('E_petal_length_tilde')[0:100, i]
print(f"petal_width_tilde[{i}] = {data['petal_width_tilde'][i]}")
print(f" mean(E_petal_length_tilde[{i}]) = {np.mean(E_length_draws):.2f}")
print(f" sd(E_petal_length_tilde[{i}]) = {np.std(E_length_draws):.2f}")
print(f" mean(petal_length_tilde[{i}]) = {np.mean(length_draws):.2f}")
print(f" sd(petal_length[{i}]) = {np.std(length_draws):.2f}\n")Estimated petal lengths given petal widths
petal_width_tilde[0] = 0.4
mean(E_petal_length_tilde[0]) = 1.98
sd(E_petal_length_tilde[0]) = 0.07
mean(petal_length_tilde[0]) = 1.88
sd(petal_length[0]) = 0.45
petal_width_tilde[1] = 1.75
mean(E_petal_length_tilde[1]) = 4.98
sd(E_petal_length_tilde[1]) = 0.06
mean(petal_length_tilde[1]) = 4.96
sd(petal_length[1]) = 0.50
petal_width_tilde[2] = 3.8
mean(E_petal_length_tilde[2]) = 9.55
sd(E_petal_length_tilde[2]) = 0.17
mean(petal_length_tilde[2]) = 9.52
sd(petal_length[2]) = 0.48
For each of our input petal widths \([0.4 \quad 1.75 \quad 3.8]\), we see the posterior mean prediction for petal width and its standard deviation calculated two ways. First, we take the posterior draws for the expected petal length, E_petal_length_tilde, which is just the linear prediction of petal value. The uncertainty comes only from estimation uncertainty in \(\alpha\) and \(\beta\). Second, we take the posterior draws for petal length, which include an additional normal error term with scale \(\sigma\). The means are roughly the same either way, but the standard deviation is much higher in the case of sampling. The second variable, petal_length_tilde, includes both sources of posterior uncertainty, estimation uncertainty and uncertainty from the sampling distribution.
9.6 Lognormal and transformed data
In this section, we will convert our iris model to the log scale using the transformed data block to convert the petal width to a log scale and use a lognormal regression on the width to predict the length.
The lognormal distribution is a simple transform of a normal distribution with support over positive values. If \[ U \sim \textrm{lognormal}(\mu, \sigma), \] then \[ log U \sim \textrm{normal}(\mu, \sigma). \] This will allow us to transform our Stan program. We first translate the petal width to the log scale in the transformed data block, then model length as a lognormal regression on log width in the model block.
../stan/iris-petals-log.stan
data {
int<lower=0> N;
vector<lower=0>[N] petal_width;
vector<lower=0>[N] petal_length;
}
transformed data {
vector[N] log_petal_width = log(petal_width);
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
petal_length ~ lognormal(alpha + beta * log_petal_width, sigma);
alpha ~ normal(0, 3);
beta ~ normal(0, 3);
sigma ~ lognormal(0, 0.6);
}We have introduced a new block for transformed data, in which we have defined a vector of log petal widths to use as a predictor. The petal lengths are already positive and we do not modify those. The effect is that we are regressing log petal length on log petal width and the sampling statement could have been replaced with
log(petal_length) ~ normal(alpha + beta * log_petal_width, sigma);The result would be the same up to a constant normalizing term involving petal length (which is constant) in the lognormal density.
Let’s run the model and summarize the results.
M = 100 if DRAFT else 1000
log_model = csp.CmdStanModel(stan_file =
'../stan/iris-petals-log.stan')
log_sample = log_model.sample(data = iris_data(), seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False,
show_console = False)
log_sample.summary(sig_figs = 2)| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | 56.00 | 0.02800 | 1.3000 | 53.00 | 56.00 | 57.00 | 2000.0 | 14000.0 | 1.0 |
| alpha | 1.30 | 0.00021 | 0.0140 | 1.30 | 1.30 | 1.30 | 4100.0 | 30000.0 | 1.0 |
| beta | 0.57 | 0.00022 | 0.0130 | 0.55 | 0.57 | 0.59 | 3800.0 | 27000.0 | 1.0 |
| sigma | 0.16 | 0.00017 | 0.0098 | 0.15 | 0.16 | 0.18 | 3300.0 | 23000.0 | 1.0 |
The estimated values ae \(\widehat{\alpha} = 1.3\) and \(\widehat{\beta} = 0.57\) and \(\sigma = 0.16\). One advantage of the log scale is that the regression makes more sense because \(\exp(\alpha + \beta \cdot \log w) = \exp(\alpha) \cdot w^{\beta}\). By taking exponents, the intercept \(\alpha\) becomes a multiplicative factor \(\exp(\alpha)\) and the slope \(\beta\) becomes an exponent. The error also moves to the multiplicative scale, which makes it relative. Multiplicative error makes sense here as we wouldn’t expect the same error scale for petals of width 0.5cm as for petals of width 5cm. The final relationship derived through this model is as follows. \[\begin{align} \textrm{length} &= \exp(\alpha + \beta * \log(\textrm{width}) \ \pm \ 2 \cdot \sigma) \\[6pt] &= 3.67 \cdot \textrm{width}^{0.56} \cdot 1.38^{\pm 1}. \end{align}\] The plus or minus 1 on the exponent turns into either multiplication or division by 1.38.
The posterior intervals are tighter for the lognormal model, which generally indicates a better fit to data. Let’s see what the data and twenty posterior draws of the regression look like on the log scale.
df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
alpha_draws = log_sample.stan_variable('alpha')
beta_draws = log_sample.stan_variable('beta')
plot = pn.ggplot(df, pn.aes(x='log_petal_width',
y='log_petal_length',
color='species'))
plot = plot + pn.geom_point()
plot = plot + pn.labs(x = "petal width (log cm)",
y = "petal length (log cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
plot = plot + pn.geom_abline(intercept = a, slope = b,
alpha = 0.5, size = 0.2)
mydraw(plot)Now the iris versicolor and iris virginica look good, but the smaller iris setosa is still poorly characterized. We’ll fix this problem in the next section.
9.7 Multi-indexing: varying slopes and varying intercepts
We can see from the previous section that a single regression line doesn’t fit all three species of iris well. While the lognormal model is better than the linear model, it still fails to capture the characteristics of the smaller iris setosa.
On the log scale, it does look like it will be possible to capture iris setosa’s scale, as long we can build separate regressions for each species. The varying slopes and varying intercepts are sometimes called random effects, in contrast to the previous models’ fixed effects, which do not vary by data grouping.
Mathematically, we now have three intercepts three slopes, one pair for each species of iris. We will represent these as 3-vectors, \(\alpha, \beta \in \mathbb{R}^3\). Given our \(N\) data items, we will introduce an indexing data item \(\textrm{species} \in \{ 1, 2, 3 \}^N\), where \(\textrm{species}_n\) is the species of the \(n\)-th data item. Sticking to the log scale, our regression now looks like this, \[ \textrm{length}_n \sim \textrm{lognormal}(\alpha_{\textrm{species}_n} + \beta_{\textrm{species}_n} \cdot \log \textrm{width}_n, \sigma). \] The value \(\alpha_{\textrm{species}_n}\) is the intercept for \(\textrm{species}_n \in \{1, 2, 3\}\). We have kept the same error term \(\sigma\), though we could have also split that out on a per-species basis like the slope and intercept.
The Stan code follows the statistical notation.
../stan/iris-petals-varying.stan
data {
int<lower=0> N;
vector<lower=0>[N] petal_width;
vector<lower=0>[N] petal_length;
int<lower=0> K;
array[N] int<lower=1, upper=K> species;
}
transformed data {
vector[N] log_petal_width = log(petal_width);
}
parameters {
vector[K] alpha;
vector[K] beta;
real<lower=0> sigma;
}
model {
petal_length ~ lognormal(alpha[species] + beta[species] .* log_petal_width,
sigma);
alpha ~ normal(0, 3);
beta ~ normal(0, 3);
sigma ~ lognormal(0, 0.6);
}Here, we take \(K\) to be the number of species and the species indicator thus takes on values between \(1\) and \(K\) (inclusive). Where before we had alpha and beta for the intercept and slope, we now have alpha[species] and beta[species]. These will pick our vectors of size \(N\) (number of data items), as explained below. Further, the product of the slope and width is now an elementwise product (.*), which we also consider below. The sampling statement above is equivalent to
for (n in 1:N) {
petal_length[n]
~ lognormal(alpha[species[n]] + beta[species[n]] * log_petal_width[n],
sigma);
}Elementwise products in Stan
The elementwise product (.*) uses MATLAB syntax and is defined so that
(a .* b)[n] == a[n] * b[n] // trueElementwise division (./) works the same way and we don’t need elementwise addition or subtraction because those operations are already defined as regular vector addition and subtraction.
Multi-indexing in Stan
This model uses a technique we call multi-indexing and it works the similarly to the way indexing works in NumPy. Suppose we have a vector of size J and an array of indexes of size N.
array[N] int<lower=1, upper=J> idxs;
vector[J] foo;We can use the multiple indexes in idxs to index foo as foo[idxs]. The result is a size N vector (the type is taken from foo and the size from idxs), with values given by indexing into idxs. Indexing is defined as follows.
cols(foo[idxs]) == N // true
foo[idxs][n] == foo[idxs[n]] // trueThis tells us the number of columns of foo[idxs] is N and that indexing is done by first indexing into idxs and then using the result.
This works the same way for range indexing, such as foo[1:3], and for indexing with array literals like {7, 9, 3}, e.g.,
foo[3:5] == {foo[3], foo[4], foo[5]} // true
foo[{7, 9, 3}] == {foo[7], foo[9], foo[3]} // trueFitting our varying effects model
First, we add a species column to our data frame to represent the species of the iris as an integer 1, 2, or 3. Then we will just read in the data and fit and summarize the results.
M = 100 if DRAFT else 1000
vary_model = csp.CmdStanModel(
stan_file = '../stan/iris-petals-varying.stan')
vary_sample = vary_model.sample(data = iris_data(), seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False,
show_console = False)
vary_sample.summary(sig_figs = 2)| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | 130.000 | 0.04900 | 1.900 | 130.000 | 130.000 | 130.00 | 1500.00 | 1900.00 | 1.0 |
| alpha[1] | 0.490 | 0.00110 | 0.051 | 0.410 | 0.490 | 0.57 | 2335.00 | 2997.00 | 1.0 |
| alpha[2] | 1.300 | 0.00056 | 0.028 | 1.200 | 1.300 | 1.30 | 2551.00 | 3275.00 | 1.0 |
| alpha[3] | 1.500 | 0.00140 | 0.072 | 1.400 | 1.500 | 1.70 | 2572.00 | 3301.00 | 1.0 |
| beta[1] | 0.076 | 0.00069 | 0.033 | 0.024 | 0.076 | 0.13 | 2299.00 | 2951.00 | 1.0 |
| beta[2] | 0.590 | 0.00180 | 0.093 | 0.440 | 0.590 | 0.75 | 2549.00 | 3272.00 | 1.0 |
| beta[3] | 0.230 | 0.00200 | 0.100 | 0.064 | 0.230 | 0.40 | 2588.00 | 3322.00 | 1.0 |
| sigma | 0.100 | 0.00000 | 0.010 | 0.090 | 0.100 | 0.11 | 3305.42 | 4243.15 | 1.0 |
There are a few things worth noting in this summary. First, recall that the parameter estimates in the fixed effects model were \(\widehat{\alpha} = 1.3\), \(\widehat{\beta} = 0.57\), and \(\widehat{\sigma} = 0.16\). In the varying effects model, we see that the estimates for \(\alpha_k\) and \(\beta_k\) vary considerably by \(k\). For example, as we see in the data, there is very little effect of width on length for iris setosa, with \(\widehat{\beta}_1 = 0.075\), wheres there is a large effect for iris versicolor of \(\widehat{\beta}_2 = 0.6\). Furthermore, our value for \(\sigma\) is lower, which indicates lower residual error in our predictors. We will leave evaluating varying error scales to the reader.
We now have three different regression lines, which we plot on the log scale with color matching the data..
df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
plot = pn.ggplot(df, pn.aes(x='log_petal_width',
y='log_petal_length',
color='species'))
plot = plot + pn.geom_point()
plot = plot + pn.labs(x = "petal width (log cm)",
y = "petal length (log cm)")
plot = plot + pn.geom_abline(intercept = .49, slope = 0.076, color='red')
plot = plot + pn.geom_abline(intercept = 1.30, slope = 0.59, color='green')
plot = plot + pn.geom_abline(intercept = 1.50, slope = 0.23, color='blue')
mydraw(
plot
)We now have a good fit for each of the groups and, as we might expect, a lower error scale \(\sigma\) than for the model with a single regression.
10 Containers: arrays, vectors, and matrices
Stan is strongly typed, and each of the types has a distinct purpose. Constraints act as error checks in the data, transformed data, transformed parameter, and generated quantities blocks. In the parameter block, they are used to transform from unconstrained to constrained values (with an implicit change of variables adjustment). Stan Development Team (2023b) provides complete details or the full set of transforms, inverse transforms, and their log absolute Jacobian determinants.
10.1 Integer and real primitives
The two primitive types are int and real. In the compiled C++ code, these are 32-bit signed integers and 64-bit floating point values, respectively.
The third scalar type is complex, where a complex value consists of a real component and an imaginary component, both of which are represented as 64-bit floating point values.
Constrained primitive types
Integer and real types may be constrained with lower bounds, upper bounds, or both. For example, real<lower=0> is the type for a concentration, real<lower=0, upper=1> is the type for a probability, real<lower=-1, lower=1> is the type for a correlation, int<lower=0> is the type for a count, and int<lower=1, upper=5> is the type for an ordinal response to a survey.
There is a second kind of modifier, which is not technically a constraint, but is a transform. If we declare
vector<offset=mu, multiplier=sigma>[K] alpha;then the unconstrained representation is multiplied by sigma and then offset by mu to get the constrained value alpha. This is primarily used for non-centered parameterizations of hierarchical models, which are beyond this tutorial.
10.2 Arrays
For any type T, we can write array[N] T for the type of an array of size N containing elements of type T. We can also write array[M, N] T for a 2D array and so on for higher dimensionality. For example, array[3, 2] int is a two-dimensional array of integers of shape 3 by 2.
10.3 Matrices
The type matrix[M, N] is for an \(M \times N\) matrix. The type complex_matrix[M, N] is for an \(M \times N\) matrix with complex values.
Constrained matrix types
There are four special matrix types. The type cov_matrix is for positive definite matrices, whereas cholesky_factor_cov is for Cholesky factors of positive definite matrices. The type corr_matrix is for positive definite matrices with unit diagonals and cholesky_factor_corr is for Cholesky factors of correlation matrices. The Cholesky factors are lower triangular and should be preferred to the full matrices where possible.
10.4 Vectors and row vectors
The type vector[M] is for column vectors with M rows, whereas row_vector[N] is for row vectors with N columns. A column vector is like an \(M \times 1\) matrix and a row vector is like a \(1 \times N\) vector. There are also complex_vector and complex_row_vector types which work the same way.
The reason we do not just represent vectors as matrices is that we are able to reduce types in arithmetic. For example, the product of a row vector and a product vector is a scalar rather than a \(1 \times 1\) matrix. Similarly, the product of a matrix and a vector is a vector.
Constrained vector types
There is a vector type ordered[M] for vectors in ascending order, and a similar type pos_ordered[M] for vectors of non-negative values in ascending order.
The vector type unit_vector[M] is for vectors of unit Euclidean length (i.e., the sum of squared values is one). There is also a simplex[M] type for simplex (non-negative values that sum to one).
Assignment and function calls
Calling a function works like assigning the variables in the argument list to the variables in the function declaration. In general, Stan is what programming language researchers call covariant, meaning that if a type T is assignable to a type U, then a type array[N] T is assignable to array[N] U. Similarly, integer-valued primitives and containers may be assigned to their real or complex counterparts, and real-valued primitives and containers may be assigned to their complex counterparts (but not the other way around). Variables of a constrained type may be assigned to variables of an unconstrained type.
11 Stan Example: Discrete parameters and mixture models
Although Stan only allows parameters which are real or complex valued, it is able to work with models involving discrete parameters by marginalizing them out in the likelihood and working in expectation during inference.
There are two reasons Stan doesn’t allow discrete parameters. First, because Stan isn’t limited to directed graphical models, it’s technically impossible to pull out efficient conditional distributions for the discrete parameters in general (though it can be done in some limited cases). Second, discrete sampling tends not to perform very well because the parameters tend to get locked and not mix well.
11.1 Normal mixture model
We’ll consider a simple two-component normal mixture model parameterized with a discrete responsibility parameter. This model assumes that each item \(y_n\) is drawn from one of two possible normal distributions, \(\textrm{normal}(\mu_1, \sigma_1)\) or \(\textrm{normal}(\mu_2, \sigma_2)\), with a probability \(\lambda \in [0, 1]\) of being drawn from the first distribution. Let’s assume we have \(N\) observations \(y_n \in \mathbb{R}\). We will use \(z_n \in \{ 0, 1 \}\) as the discrete parameter representing the distribution from which \(y_n\) arose. This gives us the following model \[\begin{align} z_n &\sim \textrm{bernoulli}(\lambda) \\[6pt] y_n &\sim \textrm{normal}(\mu_{z_n}, \sigma_{z_n}) \end{align}\] In frequentist settings, the parameters \(z_n\) are sometimes called missing data to finesse a philosophical aversion to probability distributions over parameters.
11.2 Marginalization to the rescue
The immediate problem is that we cannot code this model in Stan by just following the math because we do not have discrete parameters. But what we can do is marginalize out the discrete parameters. We know from the law of total probability that if \(B\) is a discrete random variable, then we can derive the marginal probability \(p(a)\) from the joint probability function \(p(a, b)\) by \[ p(a) = \sum_{b \in B} p(a, b). \]
We can express our mixture model’s sampling distribution over both \(y\) and \(z\), \[ p(y, z \mid \lambda, \mu, \sigma) = \prod_{n = 1}^N p(y_n, z_n \mid \lambda, \mu, \sigma). \] This is called the complete data likelihood in frequentist settings.
Then we can evaluate the likelihood elementwise, \[ p(y_n, z_n \mid \lambda, \mu, \sigma) = \textrm{bernoulli}(z_n \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_{z_n}, \sigma_{z_n}). \] Now we can marginalize out the \(z_n\) by considering values 0 and 1, \[\begin{align} p(y_n \mid \lambda, \mu, \sigma) &= p(y_n, z_n = 1 \mid \lambda, \mu, \sigma) + p(y_n, z_n = 0 \mid \lambda, \mu, \sigma) \\[6pt] &= \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) \\[2pt] & \qquad + \ \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \end{align}\] We can further substitute in the value for the Bernoulli densities, which are \[\begin{align} \textrm{bernoulli}(1 \mid \lambda) &= \lambda, \ \textrm{and} \\[4pt] \textrm{bernoulli}(0 \mid \lambda) &= 1 - \lambda, \end{align}\] to produce the standard form of the mixture model sampling distribution, \[ p(y_n \mid \lambda, \mu, \sigma) = \lambda \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) + (1 - \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \]
11.3 Simulated data: adult height of men and women
Let’s consider a specific instance, a collection of height data where measurements for men and women have not been identified. Let’s assume for the sake of simulation that men have a mean height of 175cm with standard deviation of 7.5cm and women have a mean height of 162cm with a standard deviation of 6.3cm. We’ll further assume that women make up 51% of the adult population (this varies across age bands). This will allow us to simulate and plot data for 5000 randomly selected heights; we put vertical lines at the mean female height (red) and male height (blue).
M = 100 if DRAFT else 5_000
female_mean_height_cm = 162
male_mean_height_cm = 175
female_sd_height_cm = 6.3
male_sd_height_cm = 7.5
prob_female = 0.51
mu = np.array([female_mean_height_cm, male_mean_height_cm])
sigma = np.array([female_sd_height_cm, male_sd_height_cm])
sex = np.array(np.random.binomial(n = 1, p = prob_female, size=M))
heights = np.random.normal(loc = mu[sex], scale = sigma[sex], size=M)
mydraw(
pn.ggplot(pd.DataFrame({'height': heights}),
pn.aes(x = 'height')) +
pn.geom_histogram(bins=50, color='white') +
pn.labs(x = 'height') +
pn.geom_vline(xintercept=162, color='red') +
pn.geom_vline(xintercept=175, color='blue') +
pn.theme(axis_text_y = pn.element_blank(),
axis_title_y = pn.element_blank(),
axis_ticks_major_y = pn.element_blank())
)11.4 The log scale and log sum of exponents operation
We have successfully marginalized out the \(z\) and derived \(p(y \mid \lambda, \mu, \sigma)\), but this leaves us with a residual problem—Stan works on the log scale and we have done our derivation on the linear scale. That is, we start with \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \log\! \big( \, \strut & \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_0, \sigma_0) \\[2pt] & + \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1)\, \big). \end{align}\] To reduce this further, we are going to use the log-sum-exp operation, which is defined by \[ \textrm{log-sum-exp}(a, b) = \log \left( \strut \exp(a) + \exp(b)\right). \]
The reason we use log-sum-exp is that it can be made arithmetically stable in such a way as to prevent numerical overflow by pulling out the maximum value, \[ \textrm{log-sum-exp}(a, b) = \textrm{max}(a, b) + \textrm{log-sum-exp}(a - \textrm{max}(a, b), \ b - \textrm{max}(a, b)) \\[4pt] \] This form eliminates numerical overflow because exponentiation is only applied to the terms \(\left( a - \textrm{max}(a, b)\right)\) and \(\left(b - \textrm{max}(a, b)\right)\), both of which must be less than or equal to zero by construction. Underflow is mitigated by pulling the leading term \(\textrm{max}(a, b)\) out of the sum.
Using log-sum-exp, we can now rewrite our marginalized individual item likelihood as \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \textrm{log-sum-exp}\big( &\log \textrm{bernoulli}(0 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_0, \sigma_0), \\[2pt] &\log \textrm{bernoulli}(1 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_1, \sigma_1) \big). \end{align}\]
11.5 Stan program for mixtures
As in our earlier examples, the Stan code for mixtures just follows the mathematical definitions.
heights.stan
data {
int<lower=0> M;
vector<lower=0>[M] heights;
}
parameters {
real<lower=0, upper=1> lambda;
ordered[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(170, 10);
sigma ~ lognormal(log(7), 1);
for (m in 1:M) {
real lp1 = bernoulli_lpmf(0 | lambda)
+ normal_lpdf(heights[m] | mu[1], sigma[1]);
real lp2 = bernoulli_lpmf(1 | lambda)
+ normal_lpdf(heights[m] | mu[2], sigma[2]);
target += log_sum_exp(lp1, lp2);
}
}This program uses local variables lp1 and lp2. These have scope just within the for-loop and take on different values with each loop iteration. In general, local variables can be assigned and reassigned in Stan. They are declared with sizes, but cannot have constraints.
Unlike in the original model, we have to index mu and sigma from 1 because Stan arrays are indexed from 1. The biggest departure from the mathematical model as written is that we have declared mu to be of type ordered, which is the type of a vector of increasing values. The reason we do this is that the indexes are not well identified in the sense that swapping indexes preserves the same likelihood. This way, the smaller component will be the first component and the model will be identified without changing the likelihood. Betancourt (2017b) provides more details on how this works and shows that ordering does not change the underlying model.
We have included weakly informative priors for mu and sigma based on our prior knowledge of heights. The likelihood then just follows the mathematical definition. We had to resort to a loop because Stan isn’t currently able to vectorize mixtures.
11.6 Fitting the simulated data
Now we can fit our simulated data.
model = csp.CmdStanModel(stan_file = '../stan/heights.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
iter_warmup = J, iter_sampling = J,
show_progress = False, show_console = False)
sample.summary(sig_figs = 2) | Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -18000.00 | 0.0480 | 1.60 | -18000.00 | -18000.00 | -18000.00 | 1100.0 | 12.0 | 1.0 |
| lambda | 0.52 | 0.0028 | 0.07 | 0.41 | 0.52 | 0.64 | 620.0 | 6.3 | 1.0 |
| mu[1] | 162.00 | 0.0260 | 0.68 | 161.00 | 162.00 | 163.00 | 689.0 | 7.0 | 1.0 |
| mu[2] | 175.00 | 0.0470 | 1.10 | 173.00 | 175.00 | 177.00 | 592.0 | 6.0 | 1.0 |
| sigma[1] | 6.10 | 0.0100 | 0.28 | 5.70 | 6.10 | 6.60 | 733.0 | 7.5 | 1.0 |
| sigma[2] | 7.70 | 0.0180 | 0.45 | 7.00 | 7.70 | 8.50 | 623.0 | 6.3 | 1.0 |
Mixture models are quite challenging to fit when the components have similar distributions as they do here (the normal components have similar locations and scales so that much of their probability mass overlaps). Checking the fit, we find \(\widehat{R}\) values as high as 1.01 (still acceptable) and effective sample sizes of around 1/6 the number of iterations. Nevertheless, we see the Monte Carlo standard error is relatively low for all posterior means other than for \(\lambda\). Although we recover reasonable estimates for all parameters (simulated values were \(\lambda = 0.51\), \(\mu = [162 \
175]\) and \(\sigma = [6.3 \ 7.5]\)), the posterior intervals are only well constrained for mu, with sigma, with lambda being less well identified. These intervals shrink with the number observations \(N\) at a rate of \(\mathcal{O}(1 / \sqrt{N})\).
11.7 Built-in Stan function for mixtures
For simple mixtures of this kind, Stan supplies a built-in function
log_mix(lambda, lp1, lp2)
= log_sum_exp(log(lambda) + lp1,
log1m(lambda) + lp2)which is a more efficient drop-in replacement for the explicit definition in our example code. The body of the loop in the model can be replaced with the one-liner
target += log_mix(lambda, normal_lpdf(heights[m] | mu[1], sigma[1]),
normal_lpdf(heights[m] | mu[2], sigma[2]));This may run a bit faster, but it won’t mix any better as it defines exactly the same target log density as the long form.
11.8 Recovering posterior distributions over discrete parameters
We have marginalized out the discrete parameters, but what if they are of interest in the fitted model? It turns out, we can sample \(z\) in the generated quantities block. All we have to do is work out the probability that \(z_n = 0\) (female) and sample. Once we have this probability, we can calculate many quantities of interest in expectation rather than working with the sample (i.e., we will Rao-Blackwellize). The calculation for the expected sex given values of \(\mu\) and \(\sigma\) and \(y_n\) is \[ \Pr\!\left[Z_n = 1 \mid y, \mu, \sigma\right] \ = \ \mathbb{E}\left[Z_n \mid y, \mu, \sigma\right] \ = \ \frac{\displaystyle p(y_n \mid \mu_0, \sigma_0)} {\displaystyle p(y_n \mid \mu_0, \sigma_0) + p(y_n \mid \mu_1, \sigma_{z_n})} \] On the log scale where Stan operates, that’s \[ \log \textrm{Pr}[Z_n = 1 \mid y, \mu, \sigma] = \log p(y_n \mid \mu_0, \sigma_0) - \textrm{log-sum-exp}(\log p(y_n \mid \mu_0, \sigma_0), \log p(y_n \mid \mu_1, \sigma_1). \]
Here’s a new generated quantities block we can add to the last program, heights.stan, in order to generate the probability that the first 10 data items are heights for women, and to randomly generate the sex for the first ten individuals based on this probability.
../stan/heights-post.stan
generated quantities {
vector<lower=0, upper=1>[10] Pr_female;
for (m in 1:10) {
real lp1 = bernoulli_lpmf(0 | lambda)
+ normal_lpdf(heights[m] | mu[1], sigma[1]);
real lp2 = bernoulli_lpmf(1 | lambda)
+ normal_lpdf(heights[m] | mu[2], sigma[2]);
Pr_female[m] = exp(lp1 - log_sum_exp(lp1, lp2));
}
array[10] int<lower=0, upper=1> sex = bernoulli_rng(Pr_female);
}Now we can compile and fit this model.
model = csp.CmdStanModel(stan_file = '../stan/heights-post.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
iter_warmup = J, iter_sampling = J,
show_progress = False, show_console = False)
sample.summary(sig_figs = 2) | Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -18000.0000 | 0.04700 | 1.6000 | -18000.00000 | -18000.0000 | -18000.000 | 1200.0 | 13.0 | 1.0 |
| lambda | 0.5200 | 0.00270 | 0.0670 | 0.41000 | 0.5200 | 0.640 | 640.0 | 6.7 | 1.0 |
| mu[1] | 162.0000 | 0.02500 | 0.6600 | 161.00000 | 162.0000 | 163.000 | 683.0 | 7.1 | 1.0 |
| mu[2] | 175.0000 | 0.04400 | 1.1000 | 173.00000 | 175.0000 | 177.000 | 650.0 | 6.8 | 1.0 |
| sigma[1] | 6.1000 | 0.01000 | 0.2700 | 5.70000 | 6.1000 | 6.600 | 757.0 | 7.9 | 1.0 |
| sigma[2] | 7.7000 | 0.01700 | 0.4400 | 7.00000 | 7.7000 | 8.500 | 695.0 | 7.3 | 1.0 |
| Pr_female[1] | 0.9500 | 0.00180 | 0.0470 | 0.85000 | 0.9600 | 0.990 | 676.0 | 7.1 | 1.0 |
| Pr_female[2] | 0.0670 | 0.00160 | 0.0400 | 0.01900 | 0.0600 | 0.140 | 652.0 | 6.8 | 1.0 |
| Pr_female[3] | 0.0590 | 0.00140 | 0.0360 | 0.01600 | 0.0520 | 0.130 | 653.0 | 6.8 | 1.0 |
| Pr_female[4] | 0.4200 | 0.00460 | 0.1200 | 0.24000 | 0.4100 | 0.620 | 648.0 | 6.8 | 1.0 |
| Pr_female[5] | 0.0490 | 0.00120 | 0.0310 | 0.01300 | 0.0430 | 0.110 | 653.0 | 6.8 | 1.0 |
| Pr_female[6] | 0.1600 | 0.00290 | 0.0730 | 0.06200 | 0.1500 | 0.300 | 648.0 | 6.8 | 1.0 |
| Pr_female[7] | 0.4700 | 0.00470 | 0.1200 | 0.28000 | 0.4600 | 0.670 | 648.0 | 6.8 | 1.0 |
| Pr_female[8] | 0.9000 | 0.00240 | 0.0610 | 0.78000 | 0.9100 | 0.970 | 651.0 | 6.8 | 1.0 |
| Pr_female[9] | 0.0064 | 0.00024 | 0.0062 | 0.00076 | 0.0045 | 0.018 | 668.0 | 7.0 | 1.0 |
| Pr_female[10] | 0.0580 | 0.00140 | 0.0360 | 0.01600 | 0.0510 | 0.130 | 653.0 | 6.8 | 1.0 |
| sex[1] | 0.9400 | 0.00430 | 0.2300 | 0.00000 | 1.0000 | 1.000 | 2994.0 | 31.0 | 1.0 |
| sex[2] | 0.0730 | 0.00420 | 0.2600 | 0.00000 | 0.0000 | 1.000 | 3766.0 | 39.0 | 1.0 |
| sex[3] | 0.0610 | 0.00410 | 0.2400 | 0.00000 | 0.0000 | 1.000 | 3464.0 | 36.0 | 1.0 |
| sex[4] | 0.4100 | 0.00830 | 0.4900 | 0.00000 | 0.0000 | 1.000 | 3530.0 | 37.0 | 1.0 |
| sex[5] | 0.0440 | 0.00330 | 0.2000 | 0.00000 | 0.0000 | 0.000 | 3853.0 | 40.0 | 1.0 |
| sex[6] | 0.1600 | 0.00660 | 0.3700 | 0.00000 | 0.0000 | 1.000 | 3072.0 | 32.0 | 1.0 |
| sex[7] | 0.4800 | 0.00850 | 0.5000 | 0.00000 | 0.0000 | 1.000 | 3466.0 | 36.0 | 1.0 |
| sex[8] | 0.9000 | 0.00540 | 0.3000 | 0.00000 | 1.0000 | 1.000 | 3111.0 | 33.0 | 1.0 |
| sex[9] | 0.0065 | 0.00130 | 0.0800 | 0.00000 | 0.0000 | 0.000 | 3872.0 | 41.0 | 1.0 |
| sex[10] | 0.0510 | 0.00340 | 0.2200 | 0.00000 | 0.0000 | 1.000 | 4105.0 | 43.0 | 1.0 |
The probabilities range from 0 to 1 with some intermediate values. We have calculated posterior expectations two ways, by averaging expected values (Pr_female) and by taking random draws (sex). The posterior means are similar for these variables, but sex has much higher variance due to the sampling.
We can compare the probabilities to the simulated values of \(z\), which are
print(f"{sex[0:10] =}")sex[0:10] =array([0, 1, 1, 0, 0, 1, 0, 0, 1, 1])
12 Debugging Stan programs
As is usual in simple tutorials like this, we have presented a series of Stan programs that actually work as intended. When developing new programs, one often runs into problems. This section will cover some typical Stan coding errors and provide some tips on debugging.
12.1 Failure to provide support over constrained parameters
Stan requires programs to assign a finite log density to any parameters that satisfy the constraints. For example, the following program puts no constraints on sigma in the declaration, but the model block requires sigma to be greater than zero.
parameters {
real sigma;
}
model {
sigma ~ lognormal(0, 1);
} This violates Stan’s requirements because the declared constraints allow a negative value for sigma, but the model block does not. The way to fix this is to declare sigma to be of type real<lower=0>. Often this kind of bug shows up as a failure to randomly initialize. For instance, if we had vector[10] sigma, we’d only have a \(2^{-10}\) or about 1/1000 chance of generating valid initializations at random, because Stan initializes parameters using a \(\textrm{uniform}(-2, 2)\) distribution on the unconstrained scale.
12.2 Start with small, simulated data sets
It can be difficult to debug wild type data at scale. Instead, simulating a small data set lets you work the kinks out of the code in a controlled context where we know the true answers. Then, when the code is working over small simulated data, it can be loosed on the real data.
Simulating data can be challenging for some models. First, it can be a lot of effort, because it requires rewriting the model using random number generation in the generated quantities block. Second, some models are not fully generative because they either have improper priors or because they have reduced rank parameterizations (e.g., intrinsic conditional autoregressive models or sum-to-zero parameter constraints).
12.3 Debug by print
Stan has a built-in print() function that can take string and Stan variable arguments, which will be printed when it is executed. In particular, printing the target() value can help detect when it becomes \(-\infty\) (and thus causes rejections). This can be done line by line, for example,
model {
y ~ normal(alpha + beta * x, sigma);
print("step 17: target() = ", target());
sigma ~ lognormal(0, 1);
print("step 18: target() = ", target());
}Then if one of the statements throws an exception, you will see the print statement just before it was called. If one of the statements leads to a not-a-number or negative infinity value, you will see the first place this affects target(). You can also print the results of intermediate calculations. Be careful printing long structures; you might want to try printing a slice, e.g.,
matrix[256, 256] a = b * c;
print("a[1:3, 5:6] = ", a[1, 1]);12.4 Test quantities of interest
So far, we’ve only talked about making the program work as intended. We also want to test our output as we have done in several of our running examples. If we wrote a model in order to provide posterior predictive inference, test it with posterior predictive inference (either real or simulated).
12.5 More advice on workflow
Going into full details of how to develop Stan models is beyond this simple getting started tutorial. For more information, see Gelman et al. (2020). The various inference mechanisms targeted by the Bayesian workflow suggestions can be implemented in Stan as outlined in Part III of Stan Development Team (2023c).
13 Making Stan programs go faster
For MCMC, there are two independent components to efficiency, (1) the statistical efficiency of the sampler, which is measured by effective sample size per iteration, and (2) program efficiency, which is measured by how long the program takes to compute the log density and gradients.
13.1 Computational efficiency
There are many ways to write a Stan program to compute the log density and gradients of a model with respect to a fixed parameter vector. Some of them are faster than others. Which version is faster may wind up depending on the shape of the data (e.g., caching can be very valuable in some circumstances where variables are heavily reused, but can be a burden in sparse data circumstances where all combinations of variables are not used).
For speeding up Stan programs, most of the usual considerations in speeding up computer programs apply. Specifically, we want to optimize memory locality and we want to reduce branch points; together these are the main efficiency bottlenecks for modern code.
Working first, then optimize
As Donald Knuth is reputed to have said, “Premature optimization is the root of all evil.” It’s pretty much always a good idea to get a simple version of your code working on a small slice of data first. Then only worry about optimization if you need it. If you do need to optimize, don’t try it without an unoptimized form that works as a guide. Every developer has learned through hard experience that it’s often a whole lot easier to optimize working code than it is to debug optimized code.
Do not repeat yourself
The main rule for making Stan programs go fast is don’t repeat yourself. If there is a computation, don’t do that computation again, save the result in a local variable and reuse. For example, don’t do this:
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], exp(log_sigma));
}The problem here is that exp(log_sigma) gets computed N times. This uses more memory, as Stan records every operation with links to its operands during. It is also slow, with N - 1 redundant applications of exponentiation in the forward pass and and multiply/adds for the chain rule in the backward pass. Instead, the code should be refactored to compute the exponentiation of log_sigma once, assign it to a local variable, and re-use it in the loop.
real sigma = exp(log_sigma);
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma);
}Avoid automatic differentiation if possible
If you have a constant that’s being defined, define it in the transformed data block. That way it only gets evaluated once. For example, don’t do the following, as it repeatedly allocates and fills a \(K\)-vector and then applies useless automatic differentiation steps in the reverse pass.
data {
real<lower=0> alpha;
}
model {
vector[K] alpha_v = rep_vector(alpha, K);
theta ~ dirichlet(alpha_v);
} Instead, define the constant in the transformed data block and then reuse it in the model block, as follows.
transformed data {
vector<lower=0>[K] alphas = rep_vector(alpha, K);
}
model {
theta ~ dirichlet(alphas);
}The other way to get rid of autodiff is to move posterior predictions into the generated quantities block. In general, if it’s possible to move a variable to the generated quantities block, it should be moved. For example, we might be tempted do posterior predictive simulation as follows.
parameters {
real alpha, beta;
real<lower=0> sigma;
vector[M_tilde] y_tilde;
}
model {
y ~ normal(alpha + beta * x, sigma);
y_tilde ~ normal(alpha + beta * x_tilde, sigma);
}This is correct and will provide the right answer, but it is computationally and statistically inefficient. It’s computationally inefficient because Hamiltonian Monte Carlo is more expensive than random variate generation. It’s also statistically inefficient because it will typically produce correlated draws rather than independent draws, leading to a lower effective sample size. In situations like this, we can move y_tilde down to generated quantities because it’s conditionally independent of the data given mu and sigma.
parameters {
real alpha, beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
array[M_tilde] y_tilde = normal_rng(alpha + beta * x_tilde, sigma);
} Vectorized _rng functions return arrays; if we really need a vector for y_tilde for some reason, we can convert with to_vector().
Do not use diagonal multivariate normal
Instead of
y ~ multi_normal(mu, diag_matrix(sigma));which creates a matrix then factors to compute its inverse and determinant, we can use the vectorized form of the univariate normal,
y ~ normal(mu, sigma);The result is the same up to a constant.
Memory locality, copying, and storage order
Accessing a row or column of a matrix requires allocating memory and copying. Memory pressure is one of the biggest constraints in modern code and relieving it is one of the best things to do to make code go faster.
Matrices are organized in column-major order in memory, whereas arrays are row major. That means that the columns of a matrix are stored together in memory, and therefore slicing out a column Sigma[1:M, n] is relatively efficient for large M, but slicing out a row using Sigma[m, 1:N] will be inefficient if M is large.
If you need to access vectors one by one but you do not need matrix operations, use an array of vectors in either order:
matrix[M, N] alpha;
array[M] row_vector[N] alpha_arr;
array[N] vector[M] alpha_t_arr; // transposedAccessing elements of an array does not require any copying.
Overall, copying and rearranging data structures is not very expensive in Stan. Loops are very fast, and rearranging doesn’t add anything to the expression graph for automatic differentiation.
Vectorize non-linear operations over parameters
Even though loops are very fast in Stan, every expression involving parameters leads to nodes and edges being added to the expression graph. The usual suspect is sampling in a loop, as follows.
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma);
}Written this way, Stan isn’t smart enough to realize it can compute log(sigma) once and re-use. It will also build a huge number of expression graph edges from the log density of to alpha and beta, one for each y[n]. The same contribution to the log density can be executed much faster with vectorized code such as the following.
y ~ normal(alpha + beta * x, sigma);The speed from vectorization is not because loops are slow, it’s because doing autodiff in a loop leads to a much larger expression graph and hence much more work when updating the chain rule. The vectorized form has a single output variable for the log density with three edges leading to alpha, beta, and sigma. It will also be clever enough to calculate log(sigma) once and reuse it. This advantage is even bigger for distributions like multivariate normal that require expensive internal determinant calculations.
13.2 Statistical efficiency
By far the best solution for making Stan programs go faster is to improve the statistical efficiency of the model.
Identifiability
The first thing to consider is identifiability. A likelihood \(p(y \mid \theta)\) is identifiable if distinct parameters lead to distinct likelihoods, or in symbols, if \(\theta \neq \theta'\) implies \(p(y \mid \theta) \neq p(y \mid \theta')\). The other way around, a likelihood is not identifiable if there exist \(\theta \neq \theta'\) such that \(p(y \mid \theta) = p(y \mid \theta').\)
The classic Bradley-Terry model for inferring quality based on pairwise comparisons has a non-identifiable likelihood. Let’s consider an application to sports team ranking (it was originally applied to consumer products). Each team \(j\) has an ability ranking \(\alpha_j\). The log odds of team \(j\) defeating team \(k\) is modeled as \(\alpha_j - \alpha_k\). The observations are of the results of games \(n\) between team \(A_n\) and \(B_n\) (the notation is meant to be suggestive of the original application to A/B testing). This defines the likelihood \[ p(y_n \mid \alpha) = \textrm{bernoulli}(\textrm{logit}^{-1}(\alpha_{A_n} - \alpha_{B_n}). \] In Stan, this likelihood can be coded as
model {
y ~ bernoulli_logit(alpha[A] - alpha[B]);
}where the vectorized sampling statement unfolds to the following equivalent, but less efficient, form.
for (n in 1:N) {
y[n] ~ bernoulli_logit(alpha[A[n]] - alpha[B[n]]);
} The model is clearly non-identifiable because \[ p(y_n \mid \alpha) = p(y_n \mid \alpha + c). \]
To get around this problem, we can do three things. The first two involve reducing the number of free parameters and the third involves soft identification through a prior. First, we can pin one of the values \(\alpha\), traditionally by setting \(\alpha_1 = 0\). This reduces the number of free parameters by one, as is reflected in the Stan parameterization.
parameters {
vector[J - 1] alpha_rest;
}
transformed parameters {
vector[J] alpha = append_row(0, alpha_rest);
}We can then work with alpha directly. In the second approach, we can enforce the constraint \(\textrm{sum}(\alpha) = 0\). This also reduces the number of free parameters by one and is coded in a similar way, only with a different transform.
transformed parameters {
vector[J] alpha = append_row(-sum(alpha_rest), alpha_rest);
} On the plus side, this approach is symmetric. On the downside, it is not generative in that it is not clear how to add a new player \(J + 1\).
The third way to make a Bayesian model identifiable is to add a prior. For example, we can do this,
model {
alpha ~ normal(0, 3);
}This provides a kind of soft identification in the posterior, where values of alpha near 0 have higher posterior density. The third approach is overparameterized but has a pleasing symmetry and generalizability to new players.
Adaptation and preconditioning
The condition number of a positive-definite matrix is the ratio of its largest eigenvalue to its smallest. For a density, we can consider the negative inverse Hessian of its log density at a point. For a multivariate normal distribution, the negative inverse Hessian is its covariance at every point. That is, a normal distribution has constant curvature.
The larger the condition number, the harder it is to sample from a model, because multiple steps at the scale of the smallest eigenvalue are required to take a step in the direction of the largest eigenvalue. The first-order gradient-based approximation that Hamiltonian Monte Carlo uses can also fail due to high curvature; when the step is too large, the result will be divergent Hamiltonian trajectories (i.e., ones that do not preserve the Hamiltonian).
When the inverse Hessian is not positive definite, the density is not convex, which can present an even greater challenge to sampling than poor conditioning. If the inverse Hessian is positive definite and relatively constant over the posterior, we can correct for non-unit conditioning by preconditioning, which can rotate and scale a density until it looks more like a standard normal. In fact, given a multivariate normal it does just that—rotates and scales back to a standard normal.
Stan begins adaptation in a state that assumes the posterior is standard normal, which is (a) centered at the origin, (b) independent in dimension, and (c) unit scale. The closer a posterior is to standard normal, the easier it will be for Stan to adapt and to sample.
During adaptation, Stan estimates the posterior (co)variance to use for preconditioning. If the posterior Hessian is constant, the covariance will be equal to the negative inverse Hessian everywhere.
With its default settings, Stan preconditions by scaling each of the parameters as close to unit scale as it can manage. Under this scheme, in \(D\) dimensions, the adapted metric requires \(\mathcal{O}(D)\) storage and each leapfrog step (the most inner loop in Stan) requires \(\mathcal{O}(D)\) time. By selecting a dense metric, Stan will also adapt to a fixed curvature (i.e., one that does not vary over the posterior). This requires \(\mathcal{O}(D^2)\) storage, several \(\mathcal{O}(D^3)\) costs to Cholesky factor, and then requires \(\mathcal{O}(D^2)\) time per leapfrog step to evaluate.
Reparameterization
We can often reparameterize models to make them easier to sample. The standard example is reparameterizing a varying effect to use a non-centered parameterization. The centered parameterization is
parameters {
real<lower=0> sigma;
vector[N] alpha;
}
model {
sigma ~ lognormal(0, 1.5);
alpha ~ normal(0, sigma);
}This model induces a strong dependence between sigma and alpha. When sigma is small, alpha must be small, and when sigma is large, alpha will vary over a wide range. This varying curvature (high curvature for low sigma, very flat expanses for high sigma) is very challenging for MCMC sampling.
We can overcome the difficulty in this parameterization by non-centering, which we can code explicitly as follows.
parameters {
real<lower=0> sigma;
vector[N] alpha_std;
}
transformed parameters {
alpha = sigma * alpha_std;
}
model {
sigma ~ lognormal(0, 1.5);
alpha_std ~ normal(0, 1);
}We now have an independent standard normal distribution on alpha_std, not on alpha. It induces the correct distribution on the transformed parameter alpha, so that alpha has a normal(0, sigma) distribution. While this correction is only in the prior, if there is not a lot of data for each n, the posterior will be similar.
We can code the same model as above using an affine variable transform to rescale alpha with sigma.
parameters {
real<lower=0> sigma;
vector<multiplier=sigma>[N] alpha; // non-centering
}
model {
sigma ~ lognormal(0, 1.5)
alpha ~ normal(0, sigma);
}Priors for speed and knowledge
Often we are tempted to use vague or overdispersed priors. When this gets too extreme, it can cause computational difficulties navigating flat expanses of probability space. And we will have statistical difficulties from the prior pulling most of the probability mass away from the natural posterior.
If you have used vague priors to avoid thinking too hard and computation is going awry, you might consider adding priors with a bit more information. It’s often possible to impose weakly informative priors that only determine the scale of a variable, which is often known. For example if we think a value \(\alpha\) is going to have a value in the single digits, we can use a prior like
alpha ~ normal(0, 5);that puts approximately 95% of the probability mass in the region \((-10, 10)\).
We do not recommend interval constraints other than for naturally constrained parameters (e.g., probabilities fall in \([0, 1]\) and concentrations in \([0, \infty)\)). The reason we do not want to use
alpha ~ uniform(-10, 10);is that if the data is consistent with alpha near the boundary (-10 or 10), then probability mass gets bunched up and posterior means get pushed away from the boundary. For example, consider the following Stan program.
stan/interval-censor.stan
data {
int<lower=0> B;
int<lower=0> N;
vector[N] y;
}
parameters {
real<lower=-B, upper=B> mu;
real<lower=0> sigma;
}
model {
y ~ normal(mu, sigma);
} In this model, we have an improper uniform prior on sigma and a proper uniform prior on mu in the interval \((-B, B)\). Stan allows improper priors as long as the posterior is proper.
We can then simulate data according to the sampling distribution, \[ y_n \sim \textrm{normal}(9.9, 4). \] Although the true value of \(\mu\) is 9.9, which falls in the interval \((-10, 10)\), consider what happens when we fit 50 random draws with this model.
model = csp.CmdStanModel(stan_file = '../stan/interval-censor.stan')
N = 50
mu = 9.9
sigma = 4
y = np.random.normal(mu, sigma, N)
B = 10
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample.summary(sig_figs = 2) interval = (-10, 10)
| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -92.0 | 0.0320 | 1.10 | -94.0 | -91.0 | -90.0 | 1200.0 | 19000.0 | 1.0 |
| mu | 9.7 | 0.0057 | 0.28 | 9.1 | 9.7 | 10.0 | 2300.0 | 38000.0 | 1.0 |
| sigma | 3.8 | 0.0100 | 0.41 | 3.2 | 3.8 | 4.5 | 1500.0 | 25000.0 | 1.0 |
What happens is that all of the posterior mass for \(\mu\) that would fall above 10 is pushed down inside the interval. As a result, we bias \(\mu\) and \(\sigma\) so that their estimates are too low. If we were to expand the interval to 20 and fit the same data, the result is quite different.
B = 20
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample2 = model.sample(data = data, seed = 123,
iter_warmup = M, iter_sampling = M,
show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample2.summary(sig_figs = 2) interval = (-20, 20)
| Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
|---|---|---|---|---|---|---|---|---|---|
| lp__ | -88.0 | 0.0250 | 1.10 | -90.0 | -87.0 | -87.0 | 1700.0 | 32000.0 | 1.0 |
| mu | 10.0 | 0.0110 | 0.54 | 9.4 | 10.0 | 11.0 | 2500.0 | 45000.0 | 1.0 |
| sigma | 3.8 | 0.0078 | 0.42 | 3.2 | 3.8 | 4.6 | 2900.0 | 53000.0 | 1.0 |
If you were thinking we should recover \(\mu = 9.9\) and \(\sigma = 4\), it’s worth noting the sample mean and standard deviation for our simulated data y, which are
print(f"mean(y) = {np.mean(y):5.2f}; sd(y) = {np.std(y):4.2f}")mean(y) = 10.28; sd(y) = 3.67
13.3 A final note on priors
Priors are particularly important when there is not much data in order to control computation. We can evaluate prior sensitivity by considering changing priors, as we did in the last section, to see what their effect is on the posterior.
In general, we would recommend using as much prior knowledge as you can during inference. Not doing so leaves useful information on the table.
14 Further reading
14.1 Scientific programming in Python
A good introduction to Python for data science applications is VanderPlas (2023). The current edition is available open access from the GitHub repository for the Python Data Science Handbook.
For intermediate Python programming, a standard reference is Ramalho (2022). The current edition is available from the GitHub repository for Fluent Python.
14.2 Random number generation
The standard reference for pseudorandom number generation is still the book by Devroye (1986). It starts from scratch with an overview of uniform pseudorandom number generation and proceeds through advanced multivariate techniques. The author has provided a free online pdf on the home page of Non-Uniform Random Variate Generation.
14.3 Bayesian statistics
For an introduction to Bayesian statistics, I would highly recommend starting with the book by McElreath (2023). The second edition is available as a free online pdf from the GitHub repository for Statistical Rethinking.
For more advanced Bayesian statistics and more on computation, the third edition of the classic by Gelman et al. (2013). The third edition is available as a free pdf from the home page for Bayesian Data Analysis 3.
14.4 Markov chain Monte Carlo
A good single-source for learning about MCMC methods is the handbook edited by Brooks et al. (2011). The key chapters on basic MCMC theory, convergence monitoring, and Hamiltonian Monte Carlo are available for free from the sample chapters page of the Handbook of Markov Chain Monte Carlo.
We can also recommend the paper by Roberts and Rosenthal (2004) as a solid mathematical foundation for MCMC that is about the same mathematical level as the appendices of this introduction.
14.5 Stan
The reference Stan paper is by our original development team (Carpenter et al. 2015), but there is no point in reading that paper if you have read this. It’s better to skip directly to the Stan User’s Guide (Stan Development Team 2023c), which contains much more information on modeling in Stan than this introduction. For full details on Stan’s programming language and execution, see the Stan Reference Manual (Stan Development Team 2023b); for information on the library of special mathematical and statistical functions, see the Stan Functions Reference (Stan Development Team 2023a).
There is also a wealth of information available in the form of Stan case studies and Stan conference proceedings, both of which are distributed in the form of reproducible notebooks.
14.6 Bayesian workflow
The paper that kicked off the Stan project’s focus on workflow is Gabry et al. (2019); the current state of our thinking is Gelman et al. (2020). Many of the Stan developers are working on turning the paper into a book; the current state is available from the public GitHub repository for Bayesian Workflow.
A. Probability theory
This brief overview of the main definitions of probability theory was greatly influenced by Anderson and Moore (1979), Appendix A, “Brief Review of Results of Probability Theory.”
A.1 Probability spaces
Sample spaces and \(\sigma\)-algebras
A sample space is a non-empty set \(\Omega\), the members of which represent possible ways the world might be (i.e., what are sometimes known as “possible worlds”).
A \(\sigma\)-algebra \(\mathcal{S}\) is a set of subsets of \(\Omega\), \(\mathcal{S} \subseteq \textrm{pow}(\Omega)\), that satisfy three properties,
- unit: \(\emptyset \in \mathcal{S}\),
- closed under complements: if \(A \in \mathcal{S},\) then \(A^{\complement} \in \mathcal{S}\), and
- closed under countable unions: if \(A_n \in \mathcal{S}\) for \(n \in \mathbb{N},\) then \(\bigcup_{n=0}^N A_n \in \mathcal{S}\).
Example 1: The power set of any non-empty set forms a complete atomic boolean algebra under union, intersection, and complementation, and is hence a \(\sigma\)-algebra. The simplest \(\sigma\)-algebra has a singleton sample space \(\Omega = \{ 0 \}\), with measurable events \(\mathcal{S} = \left\{ \{\}, \{ 0 \} \right\}\). The simplest non-trivial \(\sigma\)-algebra has a binary sample space \(\Omega = \{ 0, 1 \}\), with measurable sets \(\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}\).
Example 2: A second simple example can be defined over the real interval \((0, 1)\) as the smallest set including all of the open intervals \((a, b) \subset (0, 1)\) and closed under complements and countable unions. A \(\sigma\)-algebra defined by closing open sets this way is called a Borel algebra.
Example 3: If \(\Omega_1, \mathcal{S}_1\) and \(\Omega_2, \mathcal{S}_2\) are \(\sigma\)-algebras, then so is the product algebra \(\Omega_1 \times \Omega_2, \left\{ A_1 \times A_2 : A_1 \in \mathcal{S}_1 \ \textrm{and} \ A_2 \in \mathcal{S}_2 \right\}\).
Measures and probability spaces
A measure over \(\mathcal{S}\) is a function \(\Pr: \mathcal{S} \rightarrow [0, 1]\) such that for all \(A, B_n \in \mathcal{S},\)
- empty set zero probability: \(\Pr[\emptyset] = 0\),
- full set unit probability: \(\Pr[\Omega] = 1\),
- non-negativity: if \(A \in \mathcal{S}\), then \(\Pr[A] \geq 0\), and
- countable additivity: if \(A_n \in \mathcal{S}\) for \(n \in \mathbb{N}\) is a sequence of pairwise disjoint sets (i.e., \(A_i \cap A_j = \emptyset\) if \(i \neq j\)), then \[ \Pr\!\left[\bigcup_{n = 0}^\infty B_n\right] = \sum_{n=0}^\infty \Pr\!\left[B_n\right]. \]
An element \(A \in \mathcal{S}\) is said to be a measurable set. The sample space \(\Omega\) with a measure \(\Pr\) over the \(\sigma\)-algebra \(\mathcal{S}\) together make up a probability space \((\Omega, \mathcal{S}, \Pr)\).
Example 1, cont. Continuing Example 1 from above, we can define a measure over the \(\sigma\)-algebra with sample space \(\Omega = \{ 0, 1 \}\) and measurable sets \(\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}\). This involves assigning probabilities to all of the events. By construction, we know that \[ \Pr[\{ \}] = 0 \qquad \Pr[\{ 0, 1 \}] = 1. \] By the countable additive property of measures, and the fact that events \(\{ 0 \}\) and \(\{ 1 \}\) are disjoint, it follows that \[ \Pr[\{ 0 \}] + \Pr[\{ 1 \}] = \Pr[\{ 0 \} \cup \{ 1 \}] = 1, \] and therefore \[ \Pr[\{ 0 \}] = 1 - \Pr[\{ 1 \}]. \] If we take the sample space to represent the boolean outcome of a football match (1 for the home team winning, 0 for the visiting team winning), then \(\Pr[\{ 1 \}]\) is the probability that the home team wins and \(\Pr[\{ 0 \}]\) is the probability that the visiting team wins.
Example 2, cont. Continuing example 2 from above, we can define a measure over intervals \((a, b) \subseteq (0, 1)\) by \[ \Pr[(a, b)] = b - a. \] This measure can then be extended to unions of non-intersecting intervals by countable additivity. Because \(A^\complement \cup A = \Omega\), it follows from countable additivity that \[ \Pr[A^{\complement}] = 1 - \Pr[A]. \] If \(a = b\), we have a singleton event \((a, a) = \{ a \}\) with \(\Pr[(a, a)] = 0\). That is, points do not have non-zero probability mass in continuous measures.
Example 3, cont. Suppose we have a product \(\sigma\)-algebra \((0, 1) \times (0, 1)\), i.e., an open unit square. We can define a basis for our measure in terms of events defined by open discs, \[ \textrm{D}([x \quad y], r) = \left\{ [x' \quad y'] : \sqrt{(x' - x)^2 + (y' - y)^2} < r \right\}, \] that fall within the unit square, i.e., \[ x - r, x+r, y - r, y + r \in (0, 1). \] For discs within the unit square, we define their probability measure by \[ \Pr[\textrm{D}([x \quad y], r)] = \pi \cdot r^2, \] Probabilities for other events are defined by countable additivity and complementation. In general, an event will have a probability equal to its area. Thus the entire sample space \(\Omega\) has a probability of 1 because it is a unit square with area is 1. This construction works with the volumes of balls in three dimensions and the hypervolumes of hyperballs in higher dimensions.
A.2 Joint and conditional probability
Joint probability
If \(A\) and \(B\) are events, we define the joint probability of \(A\) and \(B\), written \(\Pr[A, B]\), by \[ \Pr[A, B] = \Pr[A \cap B]. \] This notation extends to any number of events, with \(\Pr[A, B, C] = \Pr[A \cap B \cap C]\) and so on.
Conditional probability
If \(\Pr[B] \neq 0\), we define the conditional probability of \(A\) given \(B\) by \[ \Pr[A \mid B] = \frac{\Pr[A, B]}{\Pr[B]}. \]
Bayes’s rule
If \(\Pr[B] \neq 0\) and events \(A_0, A_1, A_2, \cdots\) partition the sample space (i.e., a mutually disjoint sequence of events that are pairwise disjoint and \(\bigcup_{n \in \mathbb{N}} A_n = \Omega\)), then \[\begin{align} \Pr[A \mid B] &= \frac{\Pr[B \mid A] \cdot \Pr[A]} {\Pr[B]} \\[4pt] &= \frac{\Pr[B \mid A] \cdot \Pr[A]} {\sum_{n \in \mathbb{N}} \Pr[B, A_n]} \\[4pt] &= \frac{\Pr[B \mid A] \cdot \Pr[A]} {\sum_{n \in \mathbb{N}} \Pr[B \mid A_n] \cdot \Pr[A_n]}. \end{align}\]
A.3 Independent events
There are several notions of independence in statistics, which are summarized below.
Independence
A pair of events \(A\) and \(B\) are said to be independent if \[ \Pr[A, B] = \Pr[A] \cdot \Pr[B]. \]
Conditional independence
A pair of events \(A\) and \(B\) are said to be conditionally independent given an event \(C\) if \[ \Pr[A, B \mid C] = \Pr[A \mid C] \cdot \Pr[B \mid C]. \]
Mutual independence
A sequence of events \(A_0, A_1, \ldots, A_N\) is said to be mutually independent if \(N = 1\), or if * \(\Pr\!\left[\bigcap_{n = 1}^N A_n\right] = \prod_{n=1}^N \Pr[A_n],\) and * every proper subsequence \(A_m, A_{m+1}, \ldots, A_{M}\) is mutually independent, where a proper subsequence is one in which \(m \neq 0\) or \(M \neq N\).
Conditional mutual independence is defined the same way.
A.4 Random variables, cdfs, pdfs, and pmfs
A random variable is a function \(X:\Omega \rightarrow \mathbb{R}\) where for every \(x \in \mathbb{R}\), the set \(\{ \omega : X(\omega) \leq x \}\) is measurable (i.e., in the \(\sigma\)-algebra of the underlying probability space). As a deterministic function, a random variable is neither a variable nor in and of itself random.
A random variable gets its randomness from the underlying probability space over which it operates. Specifically, a random variable is the kind of object that produces a real value for every possible way the world can be (i.e., for every \(\omega \in \Omega\)).
Regular functions can be naturally extended to operate on random variables by applying them elementwise. For example, \((X + Y):\Omega \rightarrow \mathbb{R}\) is a random variable whose value in a state \(\omega \in \Omega\) is given by \[ (X + Y)(\omega) = X(\omega) + Y(\omega). \]
Example 1, cont. Recall the measure over \(\Omega = \{ 0, 1 \}\) with \(\Pr[\{ 1 \}] = p\) and \(\Pr[\{ 0 \}] = 1 - p\). Let’s fix \(p = 0.3\) for concreteness. We can define a random variable \(X:\Omega \rightarrow \mathbb{R}\) by \(X(0) = -2\) and \(X(1) = 10\). This random variable represents the payoff of a 2 dollar bet placed on a successful outcome of \(X\) at 5:1 odds. Although the random variable \(X\) is just a deterministic function, its possible values are assigned probabilities by the measure, specifically, \(\Pr[X = -2] = 0.3\) abd \(\Pr[X = 10] = 0.7\).
Example 2, cont. Recall the measure over \(\Omega = (0, 1)\) that takes the probability of a subinterval to be equal to its length. We can define a random variable \(X:\Omega \rightarrow \mathbb{R}\) by setting \(X(\omega) = \log \omega\). We can also define a random variable \(Y\) over the same measure by setting \(Y(\omega) = 0\) if \(\omega < 0.2\) and \(Y(\omega) = 1\) otherwise. The variable \(X\) has uncountably many possible values, whereas \(Y\) only has finitely many.
Cumulative distribution function
Given a random variable \(X\), its cumulative distribution function (cdf) \(F_X:\mathbb{R} \rightarrow [0, 1]\) is defined for \(x \in \mathbb{R}\) by \[ F_X(x) = \Pr[X \leq x], \] where we read \(X \leq x\) as the event \(\{ \omega \in \Omega : X(\omega) \leq x \}\). The cdf of a random variable is well-defined because all of the relevant events \(X \leq x\) must be measurable in order for \(X\) to be a random variable.
Probability density function
If the cumulative distribution function \(F_X\) for a random variable \(X\) is continuous and differentiable, we define its probability density function (pdf) \(p_X:\mathbb{R} \rightarrow (0, \infty)\) by \[ p_X(x) = \frac{\textrm{d}}{\textrm{d}u} F_X(u) \bigg|_{u = x}, \] for \(x \in \mathbb{R}\).
import scipy as sp
# Define the standard normal distribution
mu = 0
sigma = 1
dist = sp.stats.norm(loc=mu, scale=sigma)
# Generate x values
x = np.linspace(-6, 6, 100)
# Generate y values for PDF and CDF
pdf_y = dist.pdf(x)
cdf_y = dist.cdf(x)
# Create a dataframe for PDF and CDF
df_pdf = pd.DataFrame({
'x': x,
'y': pdf_y,
'type': 'PDF p_X'
})
df_cdf = pd.DataFrame({
'x': x,
'y': cdf_y,
'type': 'CDF F_X'
})
# Concatenate the dataframes
df = pd.concat([df_pdf, df_cdf])
# Plot
mydraw(
pn.ggplot(df, pn.aes(x='x', y='y')) +
pn.geom_line() +
pn.facet_wrap('~ type', scales = 'free_y') +
pn.labs(x='x', y='value', title='cumulative distribution function and probability density function of X ~ normal(0, 1)') +
pn.theme(figure_size=(10, 4),
panel_spacing=0.5)
)With the plots next to each other, it’s easier to see how the pdf is the derivative of the cdf—the derivative is maximum at 0.5, and approaches zero as values for \(X\) move away from zero. The pdf is symmetric around zero because the cdf has the property that \[ F_X(x) = 1 - F_X(-x). \] The function \(1 - F_X(x)\) is known as the complementary cumulative distribution function and it’s used to improve numerical stability.
Probability mass function
If the cumulative distribution function \(F_X\) for a random variable \(X\) is discrete (in the sense of being made up only of step functions), then we define its probability mass function (pmf) \(p_X:\mathbb{N} \rightarrow (0, \infty)\) by \[ p_X(n) = p_X(n + 1) - p_X(n), \] for \(n \in \mathbb{N}\).
The support of a random variable \(X\) is the set of values it can take on, which is formally defined as the values for which the pdf or pmf is greater than 0, \[ \textrm{supp}(X) = \{ x : p_X(x) > 0 \}. \]
The characterization of random variables with pdfs and pmfs is incomplete because not all probability distributions can be classified as discrete or continuous. A random variable \(X\) can have a cdf with both continuous portions and jumps. For example, if we define a mixture of a discrete and continuous cdf (e.g., for a spike-and-slab prior in Bayesian statistics), we get a cdf that is neither discrete nor continuous.
support = np.arange(2, 13)
probs = [0, 1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36]
cdf_values = np.cumsum(probs)
# Create a dataframe
df = pd.DataFrame({
'n': support,
'cdf': cdf_values[1:12],
'cdf_low': cdf_values[0:11],
})
plot1 = (pn.ggplot(df, pn.aes(x='n', y='cdf')) +
pn.geom_point(size=2) +
pn.scale_x_continuous(breaks=np.arange(2, 13)) +
pn.labs(x='n', y='Pr[X <= n]') +
pn.theme(strip_margin_x=5.0) +
pn.scale_x_continuous(limits=(1,13),
breaks=support,
expand=(0, 0)) +
pn.ggtitle('cdf'))
for n in range(0,12):
plot1 += pn.geom_segment(pn.aes(x = n + 1,
y = cdf_values[n],
xend = n + 2 - 0.075,
yend = cdf_values[n]))
plot1 += pn.geom_point(pn.aes(x='n', y='cdf_low'),
size=2, stroke=0.5, fill='none',
shape='o', color='black')
df2 = pd.DataFrame({
'n': support,
'probs': probs[1:]
})
plot2 = (pn.ggplot(df2, pn.aes(x='n', y='probs'))
+ pn.geom_bar(stat='identity', color='black', fill='white')
+ pn.scale_x_continuous(breaks=support)
+ pn.scale_y_continuous(breaks=np.arange(1, 7) / 36,
labels=['1/36', '2/36', '3/36',
'4/36', '5/36', '6/36'])
+ pn.ggtitle('pmf'))
g1 = pw.load_ggplot(plot1, figsize=(3.25, 3))
g2 = pw.load_ggplot(plot2, figsize=(3.25, 3))
g12 = (g1 | g2)
g12.savefig()A.5 Expectations
Expectation
If we have a (potentially multivariate) random variable \(Z\), we write \(\mathbb{E}[Z]\) for its expectation, which despite the fancy name, is just its average value, weighted by density, \[ \mathbb{E}[Z] = \int_{\textrm{supp}(Z)} z \cdot p_Z(z) \, \textrm{d}z. \] Discussing random variables presupposes a background probability space \((\Omega, \mathcal{S}, \Pr)\) with respect to which the cumulative distribution function \(F_Z\) and the probability density function \(p_Z\) are defined.
Statistical notation overloads many notations, and commonly just \(Z\) rather than \(\textrm{supp}(Z)\) is used to subscript the integral.
Conditional expectation
With Bayesian statistics, we are typically interested in conditional expectations, which are the value of a random variable conditioned on the observed value of a second random variable. Suppose \(Y\) is a second random variable over the same implicit probability space and we observe that \(Y = y\). The conditional expectation of \(Z\) given \(Y = y\) is \[ \mathbb{E}[Z \mid Y = y] = \int_{Z} z \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] That is, we take a weighted average of the value of \(Z\) with weights determined by the conditional density \(p_{Z \mid Y}(z \mid y)\).
Expectation of functions of random variables
Suppose we have a random variable \(Z\) and a real-valued function \(f:\textrm{supp}(Z) \rightarrow \mathbb{R}^N\). We can then define a new random variable \(W = f(Z)\). To calculate the expectation of \(W\) in terms of the density for \(Z\), we use the so-called law of the unconscious statistician, which tells us we can simplify the change of variables back to a direct average over \(Z\), \[\begin{align} \mathbb{E}[W] &= \int_W w \cdot p_W(w) \, \textrm{d}w. \\[4pt] &= \int_W w \cdot p_Z(f^{-1}(w)) \cdot (f^{-1})'(w) \, \textrm{d}w. \\[4pt] &= \int_Z f(z) \cdot p_Z(z) \, \textrm{d}z, \end{align}\] where we have written \(f^{-1}\) for the inverse transfrom form \(W\) back to \(Z\) and \((f^{-1})'\) for its derivative.
We define the expectation of a function applied to \(Z\), such as \(f(Z)\), as \[ \mathbb{E}[W] = \mathbb{E}[f(Z)] = \int_Z f(z) \cdot p_Z(z) \, \textrm{d}z. \] The conditional form follows suit, \[ \mathbb{E}[f(Z) \mid Y = y] = \int_Z f(z) \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] Either way, we just take a weighted average the value of \(f(z)\) with weights \(p_Z(z)\) or \(p_{Z|Y}(z \mid y).\)
Variance and standard deviation
The variance of a random variable is the expectation of its squared difference from the expectation, \[ \textrm{var}[X] = \mathbb{E}\!\left[ \left( X - \mathbb{E}[X] \right)^2 \right]. \] The standard deviation is the square root of the variance, \[ \textrm{sd}[X] = \sqrt{\textrm{var}[X]}. \]
The units of the standard deviation are identical to those of the random variable \(X\). In contrast, the units of the variance are the square of the units of \(X\). For example, if \(X\) represents a distance measured in meters, then the standard deviation \(\textrm{sd}[X]\) is also expressed in meters. On the other hand, the variance \(\textrm{var}[X]\) is expressed in square meters.
A.5 Quantiles, medians, and uncertainty intervals
Suppose \(Z\) is a univariate random variable (e.g., one of the parameters, \(\Theta_d\)). If \(p \in [0, 1]\), the \(p\)-quantile for a univariate random variable \(Z\), \(\textrm{quantile}(Z, p)\), is defined by \[ \textrm{quantile}(Z, p) = z \ \textrm{ if and only if } \ \textrm{Pr}[Z \leq z] = p. \] That is, the \(p\)-quantile of \(Z\) is the value \(z\) such that the probability \(Z\) is less than or equal to \(z\) is \(p.\) In Bayesian applications, we will typically be taking the probability for functions of parameters \(Z = f(\Theta)\) and conditioning on observed data \(y\).
The 0.5-quantile is known as the median. There is a 50% chance that a random variable takes on a value less than its median and a 50% chance that it takes on a value greater. Another common Bayesian estimator for parameter \(\Theta\) is its posterior median, \[ \overline{\theta} = \textrm{quantile}(Z, 0.5). \] The median estimator has the property of minimizing expected absolute error (in contrast to the posterior mean estimate, which minimizes expected squared error), again assuming the model represents the true data generating process.
If we have two probabilities, \(0 \leq p^L \leq p^U \leq 1\), they define the \((p^L, p^U)\) posterior interval \[ \textrm{interval}(Z, p^L, p^U) = \left( \textrm{quantile}(Z, p^L), \textrm{quantile}(Z, p^U) \right). \]
The probability that a point falls in the interval is given by the differences in probabilities, \[ \textrm{Pr}\!\left[Z \in \textrm{interval}(Z, p^L, p^U)\right] = p^U - p^L. \] If \(1 - p^U = p^L\), the interval is called the central \((p^L - p^U)\) interval. For example, the central 95% interval for \(Z\) is \(\left( \textrm{quantile}(Z, 0.025), \textrm{quantile}(Z, 0.975) \right)\).
B. Bayesian statistics
In this appendix, we provide a concise mathematical overview of Bayesian statistics. Bayesian statistics is largely concerned with estimating quantities of interest based on a joint probability model of the data and parameters, given some observed data. It is also concerned with quantifying our uncertainty in these estimates. Quantities of interest are typically parameter values and deviations, event probability estimates, and forecasts/backcasts.
B.1 Bayes’s theorem
Bayes (1763) formalized his approach in the following theorem which was named after him. The basic idea is this, if we have parameters and data, we can calculate the distribution of the parameters given the data knowing on the distribution of the data given the parameters and the prior. Using names, the first step of Bayes’s theorem for densities is \[\begin{align} \underbrace{p(\textrm{params} \mid \textrm{data})}_{\textrm{posterior}} &= \underbrace{p(\textrm{data} \mid \textrm{params})}_{\textrm{sampling}} \cdot \underbrace{p(\textrm{params})}_{\textrm{prior}} \, / \, \underbrace{p(\textrm{data})}_{\textrm{evidence}}. \\[4pt] \end{align}\] In general discussions, we typically use the variable \(y\) for data and \(\theta\) for model parameters. The second step of Bayes’s theorem replaces the evidence by marginalizing data out of the numerator.
Theorem 1 (Bayes’s Theorem) Given a joint density \(p(y, \theta)\), the posterior density \(p(\theta \mid y)\) can be defined in terms that only involve the prior \(p(\theta)\) and sampling distribution \(p(y \mid \theta)\), as \[ p(\theta \mid y) \ = \ \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. \]
Proof: \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta) \, \textrm{d}\theta} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. & \textrm{[chain rule]} \end{align}\] \(\blacksquare\)
Bayes’s theorem allows us to solve the so-called inverse problem of inferring the posterior \(p(\theta \mid y)\) when all we have is the sampling distribution \(p(y \mid \theta),\) the prior \(p(\theta)\), and data \(y\).
In most casses, Stan programs do not require normalized log densities. This allows us to go a step further and drop the denominator \(p(y)\), which does not depend on \(\theta\) and write Bayes’s rule up to a proportion as \[ p(\theta \mid y) \ \propto \ p(y \mid \theta) \cdot p(\theta). \]
This lets us proceed with only an unnormalized sampling density \(p(y \mid \theta)\) and unnormalized prior \(p(\theta)\). About the only time that a normalizing constant is required is for comparing two models prior predictive distributions, which is not something we would recommend doing.
B.2 Bayesian inference
Bayesian inference is largely about estimating quantities of interest based on a probability model and observed data. In typical applied Bayesian inference problems, we are interested in three quantities that can be expressed as expectations: parameter estimates, event probabilities, and probabilistic prediction. All of these quantities are expressed as expectations over the posterior, meaning that they involve averaging over our uncertainty in parameter values.
We are also interested in uncertainty, which is defined by the posterior. We typically summarize uncertainty using quantiles, and in particular, posterior intervals (sometimes called “credible intervals”).
Parameter estimation
The first quantity of interest is the value of parameters. The standard Bayesian parameter estimate is the posterior mean, or conditional expectation given the data. Given a model \(p(y, \theta)\) and observed data \(y\), the Bayesian posterior mean estimate of the parameters \(\theta\) is
\[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]
The posterior mean is the value that minimizes expected square error in the estimates if the model is well specified in the sense of representing the true data-generating process. Squared error is the squared \(\textrm{L}_2\) norm of the difference between the estimate and the true parameter values. We can expand these definitions down to basic form.
\[\begin{align} \widehat{\theta} &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \left|\left| \, u - \Theta \, \right|\right|^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. (u - \Theta)^{\top} \cdot (u - \Theta) \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \sum_{d=1}^D \, (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \mathbb{E}\!\left[ \left. (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \int_{\Theta} (u_d - \theta_d)^2 \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]
It’s clear from the final form that the estimate \(\widehat{\theta}\) is determined by averaging over posterior uncertainty.
Event probabilities
An event in statistics is a subset of the parameter space, \(A \subseteq \Theta\), where \(\Theta\) is the set of all valid parameter values. We usually pick out events using conditions on the parameter space. For example, the condition \(\theta > 0.5\) defines the event \(A = \left\{ \theta \in \Theta : \theta > 0.5 \right\}\).
The probability of an event conditioned on data is just another posterior expectation, this time of \[\begin{align} \textrm{Pr}[A \mid y] &= \mathbb{E}[\textrm{I}[\Theta \in A] \mid y] \\[6pt] &= \int_{\Theta} \textrm{I}(\theta \in A) \cdot p(\theta \mid y) \, \textrm{d}\theta, \end{align}\] where \(\textrm{I}\) is a function defined over boolean values. The expression \(\textrm{I}(\theta \in A)\) takes on value 1 if \(\theta \in A\) and value 0 otherwise. We write \(\textrm{I}[\Theta \in A]\) using square brackets because a random variable is a function, whereas we write \(\textrm{I}(\theta \in A)\) because \(\theta \in \mathbb{R}^D\) is a value; see the Appendix section on random variables.
Posterior predictive inference
Often we are interested in predicting new data \(\tilde{y}\) given the observation of existing data \(y\). This is a form of posterior predictive inference. For example, \(y\) might be the price of a stock over some time period and \(\tilde{y}\) the price of the stock in the future. Or \(y\) might be the result of past games and \(\tilde{y}\) the winner of tomorrow’s game. Posterior predictive inference is also cast an expectation, this time of a density.
\[\begin{align} p(\tilde{y} \mid y) &= \mathbb{E}\!\left[ \, p(\tilde{y} \mid \Theta) \mid y \, \right] \\[6pt] &= \int_{\Theta} p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]
C. Markov chain Monte Carlo
Stan performs asymptotically exact, full Bayesian inference using Markov chain Monte Carlo (MCMC). MCMC starts from an initial value \(\theta^{(0)}\), which can be random, user provided, or a mixture of both, and then generates a sequence of values \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(m)}, \ldots\) forming a Markov chain. A sequence forms a Markov chain if each value is generated conditioned on only the previous value, so that \[ p(\theta^{(m + 1)} \mid \theta^{(1)}, \ldots, \theta^{(M)}) = p(\theta^{(m + 1)} \mid \theta^{(m)}). \] We call the distribution \(p(\theta^{(m + 1)} \mid \theta^{(m)})\) the transition kernel of the Markov chain. Because of this dependency, the sequence of random variables making up the Markov chain exhibit a degree of autocorrelation, meaning that \(\theta^{(m)}\) is correlated with \(\theta^{(m+1)}\).
We typically assume some mild conditions on the Markov chain transition transition kernel. First, it must be ergodic in the sense of never getting stuck in such a way it can’t visit the rest of the parameter space. Second, it must be aperiodic in the sense of not cycling regularly through disjoint regions of the parameter space. Third, it needs to preserve the target density in the sense that if \(\theta^{(m)} \sim p(\theta \mid y)\) has the posterior as its marginal distribution, then \(\theta^{(m+1)} \sim p(\theta \mid y)\) also follows the posterior distribution.
Suppose the initial value \(\theta^{(0)}\) is drawn from the posterior, \[ \theta^{(0)} \sim p(\theta \mid y). \] The notation here indicates that we apply random sampling to draw a value for \(\theta\) from the distribution \(p(\theta \mid y)\) (for known data \(y\)) and assign it to \(\theta^{(0)}\). With the initial state drawn from the posterior, each value in the Markov chain will be marginally distributed according to the posterior, which we write as \[ \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid y). \]
Typically we cannot draw an initialization from the posterior—if we could, we’d just use simpler Monte Carlo methods without the Markov chain dependencies. In the situation where \(\theta^{(0)}\) is not a draw from the posterior \(p(\theta \mid y)\), then as the chain progresses, it will converge to the right distribution in the sense that in the limit as \(m \rightarrow \infty\), the draws approach the corect distribution, so that \[ \lim_{m \rightarrow \infty} p(\theta^{(m+1)} \mid \theta^{(m)}) = p(\theta^{(m+1)} \mid y). \]
Suppose we have a posterior sample of draws \(\theta^{(1)}, \ldots, \theta^{(M)}\). With this sample, we can estimate expectations by simply plugging in values of \(\theta^{(m)}\) and averaging. If our chain satisfies the mild conditions above, we know that we get the right answer asymptotically, \[\begin{align} \mathbb{E}\!\left[f(\theta) \mid y \right] &= \int_{\Theta} f(\theta) \cdot p(\theta \mid y) \, \textrm{d}\theta \\[6pt] &= \lim_{M \rightarrow \infty} \ \frac{1}{M} \sum_{m=1}^M f\!\left( \theta^{(m)} \right). \end{align}\] With only finite time in practice, we use the initial segment of the chain to estimate expectations, \[ \mathbb{E}\!\left[f(\theta) \mid y\right] \approx \frac{1}{M} \sum_{m=1}^M f(\theta^{(m)}). \]
If our initial draw is from the posterior, this estimate is unbiased. In the usual situation, where the initial draw is not from the posterior, our expectation retains a degree of bias. Nevertheless, the limit above shows that the bias goes to zero in the limit. In practice, we typically remove an initial segment of \(N\) draws before the chain has approximately converged to the right distribution, and average the remaining draws, \[ \mathbb{E}\!\left[f(\theta) \mid y\right] \approx \frac{1}{M - N} \sum_{m=N + 1}^M f(\theta^{(m)}). \]
Given an estimate \(\widehat{\theta}\), its error is \(\widehat{\theta} - \theta\). If its expected error is zero, an estimator is said to be unbiased. If the Monte Carlo draws \(\theta^{(1)}, \ldots, \theta^{(M)}\) are independent, the central limit theorem tells us that our error follows a normal distribution asymptotically, so that as \(M \rightarrow \infty\), the error of our estimate \(\widehat{\theta}\) follows a normal distribution, \[ \widehat{\theta} - \theta \sim \textrm{normal}\!\left(0, \frac{\sigma}{\sqrt{M}}\right), \] where \(\sigma\) is the standard deviation of the variable \(\theta\). Pre-asymptotically, this result holds approximately and will be used to estimate the distribution of estimation errors.
Because draws in MCMC can be correlated (or even anti-correlated), we need the MCMC central limit theorem to generalize the central limit theorem to the case of correlated variables; Geyer (2011) and Roberts and Rosenthal (2004) provide precise definitions of the theorem.
Here, we can estimate an effective sample size (ESS), which is the number of independent draws that provide the same error distribution. If \(M^{\textrm{eff}}\) is the effective sample size of the Markov chain \(\theta^{(1)}, \ldots, \theta^{(M)}\), then the error will be distributed approximately as
\[ \widehat{\theta} - \theta \sim \textrm{normal}\!\left(0, \frac{\sigma}{\sqrt{M^{\textrm{eff}}}}\right). \]
D. Installed packages
The system and operating system are as follows.
import sys
import platform
print("\nSystem")
print(sys.version)
print("\nOperating System")
print(f""" {platform.system() = }
{platform.release() = }
{platform.version = }""")
System
3.11.3 (main, Apr 7 2023, 19:25:52) [Clang 14.0.0 (clang-1400.0.29.202)]
Operating System
platform.system() = 'Darwin'
platform.release() = '22.3.0'
platform.version = <function version at 0x103baf1a0>
The installed packages (i.e., the working set) is as follows.
import pkg_resources
print("\nInstalled packages:")
for dist in pkg_resources.working_set:
print(dist)
Installed packages:
Jinja2 3.1.2
MarkupSafe 2.1.2
Pillow 9.5.0
PyYAML 6.0
Pygments 2.15.1
QtPy 2.3.1
Send2Trash 1.8.2
anyio 3.6.2
appnope 0.1.3
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
attrs 23.1.0
backcall 0.2.0
beautifulsoup4 4.12.2
bleach 6.0.0
cffi 1.15.1
cmdstanpy 1.1.0
comm 0.1.3
contourpy 1.0.7
cycler 0.11.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.6
executing 1.2.0
fastjsonschema 2.16.3
fonttools 4.39.3
fqdn 1.5.1
idna 3.4
ipykernel 6.22.0
ipython 8.13.1
ipython-genutils 0.2.0
ipywidgets 8.0.6
isoduration 20.11.0
jedi 0.18.2
jsonpointer 2.3
jsonschema 4.17.3
jupyter 1.0.0
jupyter-client 8.2.0
jupyter-console 6.6.3
jupyter-core 5.3.0
jupyter-events 0.6.3
jupyter-server 2.5.0
jupyter-server-terminals 0.4.4
jupyterlab-pygments 0.2.2
jupyterlab-widgets 3.0.7
kiwisolver 1.4.4
matplotlib 3.7.1
matplotlib-inline 0.1.6
mistune 2.0.5
mizani 0.9.0
nbclassic 0.5.6
nbclient 0.7.4
nbconvert 7.3.1
nbformat 5.8.0
nest-asyncio 1.5.6
notebook 6.5.4
notebook-shim 0.2.3
numpy 1.24.3
packaging 23.1
pandas 2.0.1
pandocfilters 1.5.0
parso 0.8.3
patchworklib 0.6.0
patsy 0.5.3
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1.2
platformdirs 3.5.0
plotnine 0.10.1
prometheus-client 0.16.0
prompt-toolkit 3.0.38
protobuf 4.21.12
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
pyparsing 3.0.9
pyrsistent 0.19.3
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.3
pyzmq 25.0.2
qtconsole 5.4.2
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
scipy 1.10.1
setuptools 67.6.1
six 1.16.0
sniffio 1.3.0
soupsieve 2.4.1
stack-data 0.6.2
statsmodels 0.13.5
terminado 0.17.1
tinycss2 1.2.1
tornado 6.3.1
tqdm 4.65.0
traitlets 5.9.0
tzdata 2023.3
uri-template 1.2.0
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.5.1
wheel 0.40.0
widgetsnbextension 4.0.7
References
Footnotes
The statistical sampling literature often overloads “sample” to mean both a sample and a draw.↩︎